CN109842505A

CN109842505A - A kind of cloud clustering fault processing method and processing device

Info

Publication number: CN109842505A
Application number: CN201711204632.XA
Authority: CN
Inventors: 牛建华; 吴亮; 赵安安; 孙净亮; 彭朝阳; 曾重阳
Original assignee: CRSC Beijing Urban Transit Technology Co Ltd
Current assignee: CRSC Beijing Urban Transit Technology Co Ltd
Priority date: 2017-11-27
Filing date: 2017-11-27
Publication date: 2019-06-04

Abstract

The embodiment of the invention discloses a kind of cloud clustering fault processing method and processing devices to confirm the destination host failure if method includes: that the backup host of cloud cluster does not receive the heartbeat message of destination host within a preset period of time；The service for taking over the destination host continues to run, so that the destination host is restarted or repaired offline；If receiving the heartbeat message of the destination host, confirm that the destination host restores normal, and each service of operation is switched back into the destination host and is run.The backup host that the embodiment of the present invention passes through setting cloud cluster, the service of the destination host of the cloud cluster to be broken down by backup host adapter tube continues to run, and after destination host recovery, service is switched back into destination host operation, the function of backup host is more simple compared with ATS system, switch it is more convenient, to improve the operational efficiency of whole system.

Description

Cloud cluster fault processing method and device

Technical Field

The embodiment of the invention relates to the technical field of rail transit, in particular to a cloud cluster fault processing method and device.

Background

High availability technology is a technology that is of increasing interest in the field of rail transit. The availability of a rail transit system is measured in terms of the mean time to failure, i.e. how long the computer system can operate normally on average, before a failure occurs.

The current rail transit system equipment centralized station is responsible for controlling and dispatching trains, and the centralized station comprises ATS (automatic train protection) extension, ZC, interlocking and other systems. In order to improve the availability of the system, the ATS, the ZC and the interlock adopt dual-computer hot standby. The dual-computer hot standby scheme has the advantages that when one server fails, the system can be switched to the standby server to operate immediately, and the system cannot be interrupted to operate due to service failure or system downtime.

However, the ATS is responsible for complex functions such as system scheduling planning, interfacing with a fare collection system, operating pressure prediction and train allocation, machine vision, machine learning, etc. In a certain line, a plurality of stations have huge passenger flow, and the ATS system can distribute trains of other stations with small flow to stations with large passenger flow. And the ATS deep learning function is used for learning and predicting the passenger flow of the next day or the next days through the butt joint with the ticket selling and checking system, and allocating an idle train to a station with large flow in advance.

In the existing method, the operation efficiency of the system is low due to the complex function of the ATS system in the centralized station.

Disclosure of Invention

Because the existing method has the problems, the embodiment of the invention provides a cloud cluster fault processing method and device.

In a first aspect, an embodiment of the present invention provides a cloud cluster fault handling method, including:

if the standby host of the cloud cluster does not receive heartbeat information of a target host within a preset time period, confirming that the target host fails;

taking over the service of the target host to continue running so as to enable the target host to carry out offline restart or repair;

and if the heartbeat information of the target host is received, confirming that the target host is recovered to be normal, and switching each running service back to the target host to run.

Optionally, if the standby host of the cloud cluster does not receive heartbeat information of the target host within a preset time period, before determining that the target host fails, the method further includes:

the standby host of the cloud cluster receives heartbeat information of the target host through the main heartbeat line, and if the main heartbeat line fails, the standby host receives the heartbeat information of the target host through the standby heartbeat line.

Optionally, the heartbeat message includes host hardware heartbeat information, host network heartbeat information, host operating system heartbeat information, application heartbeat information, and host and disk array connection heartbeat information.

Optionally, the method further comprises:

setting a detection time interval and detection times of heartbeat information in a heartbeat configuration file, and receiving the heartbeat information of a target host according to the heartbeat configuration file.

In a second aspect, an embodiment of the present invention further provides a cloud cluster fault processing apparatus, including:

the fault confirming module is used for confirming that the target host machine is in fault if the heartbeat information of the target host machine is not received within a preset time period;

the service takeover module is used for taking over the service of the target host to continue running so as to enable the target host to be restarted or repaired off line;

and the service recovery module is used for confirming that the target host recovers to be normal and switching each running service back to the target host to run if the heartbeat information of the target host is received.

Optionally, the apparatus further comprises:

and the heartbeat receiving module is used for receiving the heartbeat information of the target host through the main heartbeat line, and if the main heartbeat line fails, the heartbeat information of the target host is received through the standby heartbeat line.

Optionally, the apparatus further comprises:

the file setting module is used for setting the detection time interval and the detection times of the heartbeat information in the heartbeat configuration file and receiving the heartbeat information of the target host according to the heartbeat configuration file.

In a third aspect, an embodiment of the present invention further provides an electronic device, including:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, which when called by the processor are capable of performing the above-described methods.

In a fourth aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium storing a computer program, which causes the computer to execute the above method.

According to the technical scheme, the standby host of the cloud cluster is set, the standby host takes over the service of the target host of the failed cloud cluster to continue running, and the service is switched back to the target host to run after the target host is recovered, so that the function of the standby host is simpler than that of an ATS (automatic train switching) system, the switching is more convenient, and the running efficiency of the whole system is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a cloud cluster fault handling method according to an embodiment of the present invention;

fig. 2 is a schematic network connection diagram of a cloud cluster according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a cloud cluster fault handling apparatus according to an embodiment of the present invention;

fig. 4 is a logic block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following further describes embodiments of the present invention with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

Fig. 1 shows a schematic flow chart of a cloud cluster fault processing method provided in this embodiment, including:

s101, if the standby host of the cloud cluster does not receive heartbeat information of the target host within a preset time period, determining that the target host fails.

The heartbeat message comprises host hardware heartbeat information, host network heartbeat information, host operating system heartbeat information, application program heartbeat information and host and disk array connection heartbeat information.

The cloud cluster comprises a plurality of hosts, each host is a server, operates corresponding service, and provides service for ATS (automatic train protection), ZC (zone controller), interlocking and other equipment. The cloud end adopts a plurality of servers to form a cluster, so that high availability, safe and stable operation of functions such as ATS (automatic transfer system) of the cloud end can be guaranteed.

The target host is a host running each service, and the standby host is a host running the service when the target host fails.

S102, taking over the service of the target host to continue running so as to enable the target host to be restarted or repaired off line.

Specifically, the standby host of the cloud cluster takes over the service of the target host to continue running, and the service is guaranteed to be uninterrupted. After the standby host takes over, the target host carries out operations such as off-line restarting or repairing and the like so as to recover the service function as soon as possible.

S103, if the heartbeat information of the target host is received, the target host is confirmed to be recovered to be normal, and each running service is switched back to the target host to run.

Specifically, once the target host can be normally served after being restarted or repaired, the heartache information is continuously sent to the standby host to inform the standby host that the functions of the standby host are normal, and after the standby host confirms that the target host is recovered to be normal, each running service is switched back to the target host to run.

The cloud cluster fault processing method provided by the embodiment is suitable for clusters of two or more rail transit cloud server centers. Multiple hosts (servers) work together, each running one or several services, each defining one or more standby hosts for a service, and when a host fails, the service running on it can be taken over by other hosts.

Specifically, the software on the host computer realizes automatic monitoring through heartbeat lines, and the condition of the operation of the other side is detected mutually through a heartbeat monitoring program, and the checked items are as follows: host hardware (CPU and peripherals), host network, host operating system, application programs, host and disk array connections. Meanwhile, the system can be automatically switched, and if a certain host confirms that the other host fails, the slave takes over the service of the host to continue running. In addition, the system can also automatically recover, and the fault host can be restarted off line after the normal host replaces the fault host to work. After the failed host is restarted, the failed host is connected with the original normal host through the heartbeat line, and the host is automatically switched back to the host after the repair is finished. The completion of the whole recovery process is automatically completed by the heartbeat.

In this embodiment, the standby host of the cloud cluster is set, the service of the target host of the cloud cluster with the fault is taken over to continue to operate through the standby host, and after the target host is recovered, the service is switched back to the target host to operate, and the function of the standby host is simpler than that of an ATS (automatic train maintenance) system, and the switching is more convenient, so that the operating efficiency of the whole system is improved.

Further, on the basis of the above embodiment of the method, before S101, the method further includes:

s100, receiving heartbeat information of a target host by a standby host of the cloud cluster through a main heartbeat line, and if the main heartbeat line fails, receiving the heartbeat information of the target host through a standby heartbeat line.

Specifically, the two heartbeat lines, namely the main heartbeat line and the passive heartbeat line, are included in this embodiment, and when the standby heartbeat line is adopted to prevent the main heartbeat line from failing, the heartbeat can still be normally transmitted through the standby heartbeat line.

Further, on the basis of the above embodiment of the method, the method further comprises:

s1001, setting a detection time interval and detection times of heartbeat information in a heartbeat configuration file, and receiving the heartbeat information of a target host according to the heartbeat configuration file.

By setting the heartbeat configuration file, a user can conveniently set the detection time interval and the detection times according to different requirements, the safety factor can be conveniently adjusted, and meanwhile, the standby host of the cloud cluster can conveniently read the heartbeat configuration file.

In the embodiment, each server (host) in the cloud cluster automatically detects and performs data communication through two heartbeat lines; the operation efficiency of the ATS of the ground equipment concentration station is improved through the cloud cluster; the cloud system is automatically switched and recovered, the system can be automatically restarted and recovered after being automatically switched, and the original host is switched back after the system is successfully restarted. The operation efficiency of the original ground equipment centralized station ATS is improved; the continuous, stable and stable operation of the cloud system is guaranteed; the automatic recovery of the cloud cluster is ensured.

For example, as shown in fig. 2, taking the minimum cluster device of the server device a and the server device B as an example, after the configuration of the primary server, the primary server records debug information of heartbeat, sets the heartbeat (monitoring) time to be 2 seconds, specifies that if the standby node does not receive the primary node heartbeat signal within 30 seconds, the standby node takes over the resources of the primary server, specifies that the time of the heartbeat delay is 10 seconds, and specifies that the standby node cannot receive the primary node heartbeat signal within 10 seconds, i.e., writes a warning log into the log, but does not switch the service. The steps of S101-S103 may be executed upon completion of the configuration. The reserved neglected time period after the system is started or restarted takes a value at least twice as large as dead time, a Udp port is used for broadcast/unicast communication, a network card eno16777736 is used for sending heartbeat detection M, and Udp multicast of a network card eth0 is used for organizing heartbeats, which is generally used when more than one standby node exists. Bcast, ucast and mcast respectively represent broadcasting, unicasting and multicasting, which are modes for organizing heartbeats, and any one of the modes is selected; the heartbeat is organized using udp unicast of the network card ens33, followed by the IP address of the dual peer IP address M. It should be noted that, the network connectivity condition may be tested by the ping gateway detecting whether the heartbeat is normal.

Fig. 3 shows a schematic structural diagram of a cloud cluster fault processing apparatus provided in this embodiment, where the apparatus includes: a failure confirmation module 301, a service takeover module 302, and a service restoration module 303, wherein:

the fault confirming module 301 is configured to confirm that the target host is faulty if heartbeat information of the target host is not received within a preset time period;

the service takeover module 302 is configured to take over the service of the target host to continue running, so that the target host is restarted or repaired offline;

the service recovery module 303 is configured to, if the heartbeat information of the target host is received, confirm that the target host recovers to be normal, and switch each running service back to the target host for running.

Specifically, if the heartbeat information of the target host is not received within a preset time period, the fault confirmation module 301 confirms that the target host is faulty; the service takeover module 302 takes over the service of the target host to continue running, so that the target host is restarted or repaired offline; if the service recovery module 303 receives the heartbeat information of the target host, it determines that the target host recovers to be normal, and switches each running service back to the target host to run.

Further, on the basis of the above embodiment of the apparatus, the apparatus further comprises:

Further, on the basis of the above device embodiment, the heartbeat message includes host hardware heartbeat information, host network heartbeat information, host operating system heartbeat information, application heartbeat information, and host and disk array connection heartbeat information.

The cloud cluster fault processing apparatus described in this embodiment may be configured to execute the method embodiments, and the principle and the technical effect are similar, which are not described herein again.

Referring to fig. 4, the electronic device includes: a processor (processor)401, a memory (memory)402, and a bus 403;

wherein,

the processor 401 and the memory 402 complete communication with each other through the bus 403;

the processor 401 is configured to call program instructions in the memory 402 to perform the methods provided by the above-described method embodiments.

The present embodiments disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the methods provided by the above-described method embodiments.

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the method embodiments described above.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

It should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A cloud cluster fault processing method is characterized by comprising the following steps:

2. The method of claim 1, wherein before the standby host of the cloud cluster receives no heartbeat message of the target host within a preset time period and confirms that the target host fails, the method further comprises:

3. The method of claim 1, wherein the heartbeat messages include host hardware heartbeat information, host network heartbeat information, host operating system heartbeat information, application heartbeat information, and host disk array connectivity heartbeat information.

4. The method according to any one of claims 1-3, further comprising:

5. The utility model provides a high in clouds cluster fault handling device which characterized in that includes:

6. The apparatus of claim 5, further comprising:

7. The apparatus of claim 5, wherein the heartbeat message comprises host hardware heartbeat information, host network heartbeat information, host operating system heartbeat information, application heartbeat information, and host disk array connectivity heartbeat information.

8. The apparatus of any of claims 5-7, further comprising:

9. An electronic device, comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 4.

10. A non-transitory computer-readable storage medium storing a computer program that causes a computer to perform the method according to any one of claims 1 to 4.