CN105634813A - Method for automatically switching nodes under dual-computer environment based on network - Google Patents
Method for automatically switching nodes under dual-computer environment based on network Download PDFInfo
- Publication number
- CN105634813A CN105634813A CN201610000774.3A CN201610000774A CN105634813A CN 105634813 A CN105634813 A CN 105634813A CN 201610000774 A CN201610000774 A CN 201610000774A CN 105634813 A CN105634813 A CN 105634813A
- Authority
- CN
- China
- Prior art keywords
- node
- network
- mds
- mdt
- ost
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- VQLYBLABXAHUDN-UHFFFAOYSA-N bis(4-fluorophenyl)-methyl-(1,2,4-triazol-1-ylmethyl)silane;methyl n-(1h-benzimidazol-2-yl)carbamate Chemical compound C1=CC=C2NC(NC(=O)OC)=NC2=C1.C=1C=C(F)C=CC=1[Si](C=1C=CC(F)=CC=1)(C)CN1C=NC=N1 VQLYBLABXAHUDN-UHFFFAOYSA-N 0.000 claims abstract description 16
- 238000011084 recovery Methods 0.000 claims abstract description 4
- 238000005192 partition Methods 0.000 abstract 1
- 238000005516 engineering process Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 1
- 230000007257 malfunction Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
- H04L41/0668—Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Hardware Redundancy (AREA)
Abstract
The invention discloses a method for automatically switching nodes under a dual-computer environment based on a network, belongs to a method for automatically switching nodes, and solves the problem that how to avoid the unavailability of the whole Lustre file system caused by the downtime of a single-point metadata server. The technical scheme is as follows: the management node, the standby management node and the login node are all connected to the mdt node and the ost node through the Ethernet switch, and the storage server is respectively connected to the management node, the standby management node, the login node, the mdt node and the ost node through the Ethernet switch; (1) deploying heartbeat service at all mds nodes and oss nodes; (2) modifying ha.cf file codes according to the actual environment of the cluster; (3) starting heartbeat service, and checking whether all IO nodes run the service; (4) manually dropping the Ethernet port of the MDS node, and observing the switching process; (5) and confirming the residual recovery time, and confirming that the Lustre partition is still normal after the time _ remaining is timed out.
Description
Technical field
The present invention relates to a kind of method that node automatically switches, a kind of method that specifically network two-shipper environment lower node automatically switches.
Background technology
Instantly HPC high-performance computing sector, the requirement of I/O bandwidth is increased by be skyrocketed through and the computational tasks of data volume day by day, and NFS file system can not meet the demand of NFS. Lustre is as a parallel file system increased income, and its powerful scalability has been widely used in HPCC environment.
But while capacity and bandwidth disclosure satisfy that calculating I/O bandwidth demand along with Lustre file system, the pressure of Lustre server is also gradually increased, especially meta data server (MDS). As the node of storage Lustre metadata, pressure is more big, and fault rate is also more high. High availability is self-evident for the importance of cluster, not only safeguards stablizing of cluster hardware structure, reduces the generation of fault, and can ensure that stablizing of file system. Once cluster file system breaks down, being catastrophic for cluster, bring the interruption even loss of data of production environment, risk is self-evident.
Summary of the invention
The technical assignment of the present invention is to provide a kind of method that network two-shipper environment lower node automatically switches, and solves the disabled problem how avoiding the single-point meta data server machine of delaying to cause whole Lustre file system.
The technical assignment of the present invention realizes in the following manner,
A kind of method that network two-shipper environment lower node automatically switches, involved hardware includes storage server, InfiniBand switch, Ethernet switch, management node, standby management node, logs in node, mds node and oss node, management node, standby management node, logging in node and connect to mdt node and ost node each through Ethernet switch, storage server via Ethernet switch is connected respectively to management node, standby management node, logs in node, mdt node and ost node; Described method comprises the steps:
(1), service at all mds nodes and oss node deployment heartbeat;
(2), ha.cf document code is revised according to cluster actual environment;
(3), open heartbeat service, check whether that all I/O node have all run this service;
(4), do not unload Lustre subregion, manually the Ethernet interface down of MDS node is fallen, observe handoff procedure;
(5), confirm to remain recovery time, after treating time_remaining timing, confirm that Lustre subregion is still normal.
Mds node includes MDS01 node and MDS02 node, and MDS01 node is mdt host node, and MDS02 node is mdt secondary node.
Oss node includes OSS01 node, OSS02 node, OSS03 node and OSS04 node; OSS01 node, OSS02 node, OSS03 node and OSS04 node are ost carry node.
OSS01 node carry ost00 and ost01; OSS02 node carry ost02 and ost03; OSS03 node carry ost04 and ost05; OSS04 node carry ost06, ost07.
The method that a kind of network two-shipper environment lower node of the present invention automatically switches has the advantage that
1, by the method monitor in real time network Heartbeat, under two-shipper environment when host node is due to malfunction and failure, host node fault-signal is informed that secondary node, secondary node take over the service of host node or the carry of memory space by heartbeat mechanism automatically. By writing script the MDS node being deployed in Lustre file system and OSS node, by the service redundant of both nodes, it is achieved the non-stop run of mdt, it is ensured that the normal operation of Lustre file system;
2, this deployment way is disposed based on script, and by installing related service under assigned catalogue, timing detects network environment, and self only takes up a small amount of system resource. And by the amendment to script, can be applicable to multiple different HA environment, colony environment;
3, after this application is disposed, not affecting storage and file system performance, take storage server resource little, after MDS active node switches, mdt recovers availability automatically, it is not necessary to manual operation; When, after OSS single point failure, another OSS being mutually redundant takes over the ost of inefficacy, automatic carry, and checks availability. To be checked complete, recover the read-write of former ost.
Accompanying drawing explanation
Below in conjunction with accompanying drawing, the present invention is further described.
Accompanying drawing 1 is the hardware block diagram of a kind of method that network two-shipper environment lower node automatically switches.
Detailed description of the invention
The method a kind of network two-shipper environment lower node of the present invention automatically switched with reference to Figure of description and specific embodiment is described in detail below.
Embodiment 1:
The method that a kind of network two-shipper environment lower node of the present invention automatically switches, involved hardware includes storage server, InfiniBand switch, Ethernet switch, management node, standby management node, logs in node, mds node and oss node, management node, standby management node, logging in node and connect to mdt node and ost node each through Ethernet switch, storage server via Ethernet switch is connected respectively to management node, standby management node, logs in node, mdt node and ost node; Described method comprises the steps:
(1), service at all mds nodes and oss node deployment heartbeat;
(2), ha.cf document code is revised according to cluster actual environment;
(3), open heartbeat service, check whether that all I/O node have all run this service;
(4), do not unload Lustre subregion, manually the Ethernet interface down of MDS node is fallen, observe handoff procedure;
(5), confirm to remain recovery time, after treating time_remaining timing, confirm that Lustre subregion is still normal.
Mds node includes MDS01 node and MDS02 node, and MDS01 node is mdt host node, and MDS02 node is mdt secondary node.
Oss node includes OSS01 node, OSS02 node, OSS03 node and OSS04 node; OSS01 node, OSS02 node, OSS03 node and OSS04 node are ost carry node.
OSS01 node carry ost00 and ost01; OSS02 node carry ost02 and ost03; OSS03 node carry ost04 and ost05; OSS04 node carry ost06, ost07.
In step (2), ha.cf document code is:
keepalive2
deadtime30
initdead120
#definedifferentudpportfordifferentpairs
#
udpport694
bcasteth0
use_logdoff
logfile/var/log/ha-log
auto_failbackoff
#
#youmustchangehere
#
nodemds01mds02
ping11.11.11.111.11.11.2
respawnhacluster/usr/lib64/heartbeat/ipfail
#addstonith
#stonith_hostmd2external/rackpdu
#stonithexternal/rackpdu/etc/ha.d/rackpdu.conf��
In step (4), observing handoff procedure is check MDS node or OSS node:
/ proc/fs/lustre/mdt/lustre-MDT0000/recovery_status and
/proc/fs/lustre/obdfilter/lustre-OST0000/recovery_status��
By detailed description of the invention above, described those skilled in the art can be easy to realize the present invention. It is understood that the present invention is not limited to above-mentioned detailed description of the invention. On the basis of disclosed embodiment, described those skilled in the art can the different technical characteristic of combination in any, thus realizing different technical schemes.
Except the technical characteristic described in description, it is the known technology of those skilled in the art.
Claims (4)
1. the method that a network two-shipper environment lower node automatically switches, it is characterized in that involved hardware includes storage server, InfiniBand switch, Ethernet switch, management node, standby management node, logs in node, mds node and oss node, management node, standby management node, logging in node and connect to mdt node and ost node each through Ethernet switch, storage server via Ethernet switch is connected respectively to management node, standby management node, logs in node, mdt node and ost node; Described method comprises the steps:
(1), service at all mds nodes and oss node deployment heartbeat;
(2), ha.cf document code is revised according to cluster actual environment;
(3), open heartbeat service, check whether that all I/O node have all run this service;
(4), manually the Ethernet interface down of MDS node is fallen, observe handoff procedure;
(5), confirm to remain recovery time, after treating time_remaining timing, confirm that Lustre subregion is still normal.
2. the method that a kind of network two-shipper environment lower node according to claim 1 automatically switches, it is characterised in that mds node includes MDS01 node and MDS02 node, and MDS01 node is mdt host node, and MDS02 node is mdt secondary node.
3. the method that a kind of network two-shipper environment lower node according to claim 1 automatically switches, it is characterised in that oss node includes OSS01 node, OSS02 node, OSS03 node and OSS04 node; OSS01 node, OSS02 node, OSS03 node and OSS04 node are ost carry node.
4. the method that a kind of network two-shipper environment lower node according to claim 3 automatically switches, it is characterised in that OSS01 node carry ost00 and ost01; OSS02 node carry ost02 and ost03; OSS03 node carry ost04 and ost05; OSS04 node carry ost06, ost07.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610000774.3A CN105634813A (en) | 2016-01-04 | 2016-01-04 | Method for automatically switching nodes under dual-computer environment based on network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610000774.3A CN105634813A (en) | 2016-01-04 | 2016-01-04 | Method for automatically switching nodes under dual-computer environment based on network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105634813A true CN105634813A (en) | 2016-06-01 |
Family
ID=56049351
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610000774.3A Pending CN105634813A (en) | 2016-01-04 | 2016-01-04 | Method for automatically switching nodes under dual-computer environment based on network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105634813A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107291390A (en) * | 2017-06-20 | 2017-10-24 | 郑州云海信息技术有限公司 | A kind of data classification storage and device |
CN109445709A (en) * | 2018-11-05 | 2019-03-08 | 郑州云海信息技术有限公司 | The management method and device of storage resource in virtualization system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103095837A (en) * | 2013-01-18 | 2013-05-08 | 浪潮电子信息产业股份有限公司 | Method achieving lustre metadata server redundancy |
CN104023061A (en) * | 2014-06-10 | 2014-09-03 | 浪潮电子信息产业股份有限公司 | High availability cluster scheme of OSS (Open Storage service) for LUSTRE |
CN104135513A (en) * | 2014-07-24 | 2014-11-05 | 浪潮集团有限公司 | A method for realizing high availability cluster by PowerPC cloud storage platform using Heartbeat |
CN104679907A (en) * | 2015-03-24 | 2015-06-03 | 新余兴邦信息产业有限公司 | Realization method and system for high-availability and high-performance database cluster |
CN105117300A (en) * | 2015-08-12 | 2015-12-02 | 浪潮(北京)电子信息产业有限公司 | Apparatus for realizing high availability of heartbeat |
-
2016
- 2016-01-04 CN CN201610000774.3A patent/CN105634813A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103095837A (en) * | 2013-01-18 | 2013-05-08 | 浪潮电子信息产业股份有限公司 | Method achieving lustre metadata server redundancy |
CN104023061A (en) * | 2014-06-10 | 2014-09-03 | 浪潮电子信息产业股份有限公司 | High availability cluster scheme of OSS (Open Storage service) for LUSTRE |
CN104135513A (en) * | 2014-07-24 | 2014-11-05 | 浪潮集团有限公司 | A method for realizing high availability cluster by PowerPC cloud storage platform using Heartbeat |
CN104679907A (en) * | 2015-03-24 | 2015-06-03 | 新余兴邦信息产业有限公司 | Realization method and system for high-availability and high-performance database cluster |
CN105117300A (en) * | 2015-08-12 | 2015-12-02 | 浪潮(北京)电子信息产业有限公司 | Apparatus for realizing high availability of heartbeat |
Non-Patent Citations (2)
Title |
---|
张晓波: "基于高性能集群计算的并行文件系统关键技术研究", 《西安电子科技大学硕士学位论文》 * |
李江昀等: "基于Linux平台的过程控制双机热备综合解决方案", 《计算机工程与应用》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107291390A (en) * | 2017-06-20 | 2017-10-24 | 郑州云海信息技术有限公司 | A kind of data classification storage and device |
CN107291390B (en) * | 2017-06-20 | 2020-05-15 | 苏州浪潮智能科技有限公司 | Data hierarchical storage method and device |
CN109445709A (en) * | 2018-11-05 | 2019-03-08 | 郑州云海信息技术有限公司 | The management method and device of storage resource in virtualization system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11360854B2 (en) | Storage cluster configuration change method, storage cluster, and computer system | |
CN107111457B (en) | Non-disruptive controller replacement in cross-cluster redundancy configuration | |
EP2435916B1 (en) | Cache data processing using cache cluster with configurable modes | |
CN106776121B (en) | Data disaster recovery device, system and method | |
CN106815097A (en) | Database disaster tolerance system and method | |
CN104536971A (en) | High-availability database | |
CN105069160A (en) | Autonomous controllable database based high-availability method and architecture | |
CN103763155A (en) | Multi-service heartbeat monitoring method for distributed type cloud storage system | |
JP2005209201A (en) | Node management in high-availability cluster | |
CN102467508A (en) | Method for providing database service and database system | |
CN104717077B (en) | A kind of method, apparatus and system for managing data center | |
CN111045602A (en) | Cluster system control method and cluster system | |
CN107395771A (en) | Full redundancy balancing the load production process data acquisition system | |
CN111399766B (en) | Data storage method, data reading method, device and system in storage system | |
CN113254275A (en) | MySQL high-availability architecture method based on distributed block device | |
CN105634813A (en) | Method for automatically switching nodes under dual-computer environment based on network | |
CN113961397A (en) | High-availability cluster disaster tolerance method based on backup disaster tolerance system | |
CN105323271B (en) | Cloud computing system and processing method and device thereof | |
CN102984009A (en) | Disaster recovery backup method for VoIP (Voice overInternet Protocol) system based on P2P | |
CN105607872A (en) | Storage apparatus | |
CN117421158A (en) | Database fault processing method, system and storage medium | |
CN112231399A (en) | Method and device applied to graph database | |
CN103384267A (en) | Parastor200 parallel storage management node high availability method based on distributed block device | |
CN107819619A (en) | A kind of continual method of access for realizing NFS | |
CN106777238B (en) | A kind of self-adapted tolerance adjusting method of HDFS distributed file system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20160601 |