CN110784350A - Design method of real-time available cluster management system - Google Patents

Design method of real-time available cluster management system Download PDF

Info

Publication number
CN110784350A
CN110784350A CN201911022253.8A CN201911022253A CN110784350A CN 110784350 A CN110784350 A CN 110784350A CN 201911022253 A CN201911022253 A CN 201911022253A CN 110784350 A CN110784350 A CN 110784350A
Authority
CN
China
Prior art keywords
application
information
node
management
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911022253.8A
Other languages
Chinese (zh)
Other versions
CN110784350B (en
Inventor
詹少博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Computer Technology and Applications
Original Assignee
Beijing Institute of Computer Technology and Applications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Computer Technology and Applications filed Critical Beijing Institute of Computer Technology and Applications
Priority to CN201911022253.8A priority Critical patent/CN110784350B/en
Publication of CN110784350A publication Critical patent/CN110784350A/en
Application granted granted Critical
Publication of CN110784350B publication Critical patent/CN110784350B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/22Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks comprising specially adapted graphical user interfaces [GUI]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Hardware Redundancy (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention relates to a design method of a real-time high-availability cluster management system, and relates to the technical field of high-availability cluster management. The real-time high-availability cluster management system designed by the invention runs in a real-time operating system, supports visual configuration, and realizes resource isolation, dynamic reconfiguration and application migration; providing high-availability guarantee support and realizing high availability of the application of the computing node; the distributed memory data management is integrated inside, and the synchronization of key data is realized through a multi-copy redundancy mechanism. The system realizes the unbinding of software and hardware, improves the utilization rate of hardware resources, automatically migrates service application to available equipment when the software and hardware faults occur, realizes fault self-shielding and ensures uninterrupted service.

Description

Design method of real-time available cluster management system
Technical Field
The invention relates to the technical field of high-availability cluster management, in particular to a design method of a real-time high-availability cluster management system.
Background
With the development of technology, the traditional management mode of checking physical devices and business applications one by one through a manual mode is no longer applicable, and the main defects of the traditional management mode include the following points:
more and more business applications and physical devices are provided, and the combination modes of the business applications and the physical devices are various, so that the mode of manually recording the deployment condition of the business applications and logging in the physical devices one by one to manage the start and stop of a specific business system is an inefficient service management mode, and a large amount of time and energy consumption is caused.
With the increase of business applications and physical devices, the frequency of software and hardware failures increases linearly. Particularly, when software and hardware faults occur in a large system consisting of a plurality of business applications, the problem of troubleshooting and solving can be a long-period work, and the increasingly urgent requirements of scientific research tasks can not be met.
Meanwhile, the failure of software and hardware can not only cause the failure of normal operation of service application, but also cause the permanent loss of data, which can not lead to the situation of complete recovery, and can not meet the real service requirement.
The high-availability cluster management system can solve the problems and has the following characteristics:
1) supporting application and hardware decoupling
By means of non-invasion, the application and the hardware are decoupled, and the business application can be migrated on a plurality of physical devices under the condition that the business process is not influenced.
2) Supporting uninterrupted services
High availability guarantee is provided for service application, the service system is protected from software and hardware faults, and the faults are self-shielded
3) Supporting service monitoring management
Application service start-stop management, network management, scheduling management and resource monitoring supporting all-around visualization
4) Supporting high availability of data
The data redundancy backup is supported, the data redundancy backup is automatically survived when a fault occurs, and the data redundancy backup can be self-healed after the fault is recovered.
At present, deployment environments of high-availability cluster management systems are non-real-time systems, and high availability of files is realized through an internal integrated distributed file system and a multi-copy redundancy mechanism; and the database is provided for real-time synchronization, and the damage resistance and disaster tolerance of key data are realized. Since access speed and access mode are limited, access based on the file system cannot meet the requirements of the real-time system, and the database depends on the file system, high-availability cluster management on a non-real-time system cannot be used for the real-time system.
Disclosure of Invention
Technical problem to be solved
The technical problem to be solved by the invention is as follows: how to design a real-time high-availability cluster management system.
(II) technical scheme
In order to solve the technical problem, the invention provides a design method of a real-time high-availability cluster management system, which runs on a computing node and a management node and is designed for high-availability management of user applications running on the computing node.
Preferably, the real-time high-availability cluster management system is designed to include a data communication module and an application management module, the data communication module is used for providing FC communication and gigabit ethernet communication data support for the compute nodes and the management nodes, and the application management module is used for performing data distribution management on the compute nodes and the management nodes and also for managing interaction control between the compute nodes and the management nodes.
Preferably, the data communication module is designed to be composed of a driving module, the driving module provides an FC driver, a network card driver and a communication protocol, and the two communication data are stored in a fusion manner by creating a memory data queue, so that data is transmitted and received in a unified virtual communication device manner.
Preferably, the application management module is designed to comprise a data synchronization module, a monitoring module, a loading module, a management module and a human-computer interaction module;
the data synchronization module is designed to provide a real-time data synchronization mechanism, task data generated by a task system is stored in a local database and is uploaded to a management node through a network, the management node is distributed to a computing node to realize real-time data backup, and when a fault occurs, an integrated computing combination switches a database instance and a service route of a task access database to a backup node in real time; after the fault is removed, automatically adding a fault recovery node into an available sequence, and simultaneously backing up data to the fault recovery node in real time, so that the effect of normal uninterrupted synchronization of the data is finally achieved; meanwhile, both the master application and the standby application can receive external data, and the external data is simultaneously sent to the master application and the standby application; the main application can back up the key data to all the computing nodes for synchronously controlling the flow and the data;
the monitoring module is designed to provide a state monitoring function for the outside, runs on each computing node, communicates with the management node, and is used for periodically acquiring the hardware resource state, the application working state and the module self-checking information of each computing node in the system, and forming the monitoring information into heartbeat messages to be periodically sent to the management node;
the loading module runs on each computing node and is specifically realized by adopting the following design:
a) reading script configuration file information during starting, and loading out application;
b) receiving out application transmitted by the management node, loading and running the application in a process mode, running a task according to a CPU core assigned by the management node, storing the application in the electronic disk, and updating the loaded application information in a configuration file;
c) after the operation is finished, sending loading completion information to the management node;
d) receiving a vxworks mapping file transmitted by a management node, and storing the vxworks mapping file in a boot partition of the electronic disk;
the management module runs on the management node, manages the main application and the standby application through mutual communication with each computing node, and responds to the man-machine interaction information;
the man-machine interaction module is designed to provide a computing node management information display function for a user.
Preferably, the monitoring module is specifically realized by adopting the following design:
a) periodically monitoring the state of each application running on the computing node, forming a heartbeat message and sending the heartbeat message to the management node;
b) periodically monitoring the in-place state of the hardware environment resource operated by the computing node and the FC and Ethernet communication states, forming a heartbeat message and sending the heartbeat message to the management node, wherein the monitoring period can be set by taking 5 milliseconds as a unit;
c) receiving resource monitoring query sent by a management node, CPU utilization rate, CPU temperature, memory capacity, electronic disk capacity, running state of each application and resource occupation situation of the application, and forming a message to be fed back to the management node;
d) receiving a self-test result query sent by a management node, and sending a power-on self-test result of the computing node equipment to the management node;
e) acquiring switching information sent by a management node in real time, switching the standby application into a main application, carrying out external communication, deleting the main application, then re-creating and starting the main application, and starting the standby application;
f) providing an API interface to acquire the working state of the currently running application: is a main application or a standby application.
Preferably, the management module is specifically implemented by the following design:
a) module initialization: sending power-on self-check monitoring information to the computing nodes, receiving the self-check information, acquiring the equipment state of each computing node, alarming the computing nodes in the fault state, carrying out corresponding processing, and sending the equipment state information to an information recording task for recording;
b) and a human-computer interaction module interaction task: receiving human-computer interaction information, including submitting application information, updating mapping information, monitoring information and the like, and sending the information to an information processing task for processing;
c) and (3) information processing tasks: processing the submitted application information, assigning the computing nodes where the main application and the standby application are located in the application information, deploying according to the application information, if the computing nodes are not assigned, sending resource monitoring query information such as a CPU (central processing unit), a memory, an electronic disk and the like to the computing nodes, selecting the computing nodes with the least resource occupation as the main application running nodes and the standby application running nodes after the information is obtained, sending the configuration information to the corresponding computing nodes, and sending the computing node resource occupation information and the newly allocated main application running information and the newly allocated standby application running information to an information recording task; processing the updated mapping information, and sending a mapping file to the computing node to be updated;
d) switching processing tasks: the heartbeat information of the computing nodes, the main applications and the standby applications is periodically acquired, when the heartbeat is not received in more than 2 periods or the hardware state of the computing nodes in the heartbeat message is a fault, the computing nodes are judged to have the fault, the fault computing nodes are alarmed, and the applications running on the fault nodes are migrated to the computing nodes with sufficient resources according to the current resource occupation condition of the rest computing nodes; when the state of the main application in the heartbeat message is a fault or is suspended, judging that the main application has a fault, sending a switching instruction to the standby application to switch the standby application to the main application, sending the switching instruction to a node where the main application is located to enable the main application to be the standby application after being deleted and restarted, and sending the switched computing node information, the main application information and the standby application information to an information recording task, wherein the switching time of the main application and the standby application is the time of one heartbeat period;
e) and (3) information recording task: receiving state information and resource information of the computing nodes and main application information and standby application information which run on the computing nodes, and recording the information on an electronic disk to form a log;
preferably, the human-computer interaction module is specifically configured to provide, through the graphic data, a CPU usage, a memory usage, network traffic information, and a disk usage of each node at each time.
Preferably, when the monitoring of the state of each application running on the computing node is performed, the monitoring period is set in units of 5 milliseconds.
Preferably, the running hardware environment resources include an ethernet card, an electronic disk, and an FC.
Preferably, the running states of the applications include normal, failure and suspension.
(III) advantageous effects
The real-time high-availability cluster management system designed by the invention runs in a real-time operating system, supports visual configuration, and realizes resource isolation, dynamic reconfiguration and application migration; providing high-availability guarantee support and realizing high availability of the application of the computing node; the distributed memory data management is integrated inside, and the synchronization of key data is realized through a multi-copy redundancy mechanism. The system realizes the unbinding of software and hardware, improves the utilization rate of hardware resources, automatically migrates service application to available equipment when the software and hardware faults occur, realizes fault self-shielding and ensures uninterrupted service.
Drawings
FIG. 1 is a diagram of a real-time high availability cluster management system architecture designed by the present invention;
FIG. 2 is a diagram of an operation scenario of a real-time high availability cluster management system designed by the present invention;
FIG. 3 is a data flow diagram of a real-time high availability cluster management system designed by the present invention;
fig. 4 is a structural diagram of a real-time high-availability cluster management system designed by the present invention.
Detailed Description
In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.
A plurality of application software runs in a process mode, one process realizes one application, at most 4 applications are supported on one computing node, each application is distributed with one CPU core to run, and the applications realize physical isolation on the CPU core and the memory.
The management node is independent of the computing node, runs the management software independently, and performs information interaction with software resident in the computing node, so that the state monitoring and hot switching can be performed on the computing node and various applications running on the computing node automatically, the scheduling and switching can be realized statically according to a configuration file, meanwhile, the man-machine interaction is supported, the flexibility is high, a user can start and stop the applications in the running process to realize dynamic migration, and the current states of each node and each application can be monitored visually; software resident in the computing node only realizes state acquisition and management command execution, occupies less resources, is relatively simple in design, and can be used for executing various applications by using main resources.
The real-time high-availability cluster management system adopts a distributed memory data management mechanism to realize data sharing, storage and backup, data are subjected to multi-copy redundant storage on different computing nodes through a network, the user memory space of each computing node is allocated with the space with the same number as the computing node application number, the number and the capacity of the space are configurable, the data applied by each computing node can be synchronously backed up to the data storage space corresponding to the application on all the computing nodes when being updated, and in order to ensure the data consistency of the computing nodes, the data backup sequence is to firstly complete the data backup of other computing nodes and then carry out the data backup work on the computing node.
Monitoring the application on the computing node, switching when a fault occurs, and reading the latest data of the application from the corresponding address of the user space of the computing node by the taken-over application software to realize application synchronization; when the application performs dynamic migration, the latest data is read from the corresponding address to realize synchronous migration.
When a plurality of applications on a computing node need to access communication equipment simultaneously, a virtual communication equipment needs to be allocated to each application, the access objects of the applications are virtual communication equipment, the virtual communication equipment stores communication data into a memory data queue, a cluster management system takes out the data from the queue and sends the data out through a physical communication equipment, the physical communication equipment receives the data and stores the data into the queue, the cluster management system takes out the data and sends the data into the corresponding virtual communication equipment, and the virtual communication equipment receives the data.
As shown in fig. 1, a real-time high-availability cluster management system architecture diagram is provided, the real-time high-availability cluster management system runs on a computing node and a management node and is used for performing high-availability management on user applications running on the computing node, and the real-time high-availability cluster management system comprises a data communication module and an application management module. The data communication module is used for providing FC communication and gigabit Ethernet communication data support for the computing nodes and the management nodes, the application management module is used for carrying out data distribution management on the computing nodes and the management nodes and also used for managing interaction control between the computing nodes and the management nodes, and the computing nodes and the management nodes jointly form an operating hardware platform of the real-time high-availability cluster management system.
Fig. 2 shows an operation scenario diagram of a real-time high-availability cluster management system, an application management module interacts with a user application through a visual interface and an API interface, and a data communication module includes two communication modes, namely, an FC network and a gigabit ethernet network. The data distribution management of the application management module to the computing nodes is to manage the computing nodes by controlling the data distribution of the data communication module.
Fig. 3 provides a data flow diagram of a real-time high-availability cluster management system, in which a user submits an application, automatically allocates a master application and a backup application according to a CPU load condition after configuring a resource attribute, and performs data synchronization, so that the master application can perform external communication, the backup application only implements passive reception, the master application and the backup application both send application heartbeat messages to a management node, the computing node sends a hardware resource heartbeat message to the management node, and the management node monitors the application and the computing node; when the main application is monitored to be abnormal, the main application can be switched to the standby application, and continuous external communication is carried out after data is synchronized, so that high availability of the application is realized; when the computing node is monitored to be abnormal, the management node carries out alarm prompt, and the application running on the failed computing node is migrated to other normal computing nodes through human operation; static application deployment can be realized through a human-computer interaction interface, and high availability is realized by allocating main and standby applications to designated computing nodes through configuration;
the API calling interface comprises a virtual communication equipment interface, a data synchronization interface and an application state monitoring interface; the virtual communication equipment interface supports network communication of a plurality of applications on one computing node, the data synchronization interface supports synchronization among the applications across the computing nodes, and the application state monitoring interface can acquire state information of main hardware resources of any computing node and application running state information.
The data communication module realizes that a plurality of virtual communication devices correspond to one physical communication device through communication device virtualization, so that a plurality of applications can access one network device at the same time; by identifying the master application and the slave application, the master application can send and receive data, and the slave application can only passively receive data.
FIG. 4 is a block diagram of a real-time high availability cluster management system configuration showing system module configuration, wherein the data communication module is composed of a driver module; the application management module comprises a data synchronization module, a monitoring module, a loading module, a management module and a man-machine interaction module.
The driving module provides FC driving, network card driving and a communication protocol, and the two communication data are fused and stored by creating a memory data queue, so that the data are transmitted and received in a unified virtual communication device manner
The data synchronization module provides a real-time data synchronization mechanism, task data generated by a task system is stored in a local database and is uploaded to a management node through a network, the management node is distributed to a computing node to realize real-time data backup, and when a fault occurs, an integrated computing combination switches a database instance and a service route of a task access database to a backup node in real time; after the fault is removed, automatically adding a fault recovery node into an available sequence, and simultaneously backing up data to the fault recovery node in real time, so that the effect of normal uninterrupted synchronization of the data is finally achieved; meanwhile, both the master application and the standby application can receive external data, and the external data is simultaneously sent to the master application and the standby application; the primary application can back up critical data to all compute nodes for synchronizing control flows and data.
The monitoring module is used for providing an all-dimensional state monitoring function for the outside, runs between each computing node and the management node for communication, is used for periodically acquiring the hardware resource state, the application working state and the module self-checking information of each computing node in the system, and forms the monitoring information into a heartbeat message period to be sent to the management node, and is specifically realized by adopting the following design:
g) the state of each application running on the computing node is periodically monitored, a heartbeat message is formed and sent to the management node, the state information is provided by the application and the system, and the monitoring period can be set by taking 5 milliseconds as a unit;
h) periodically monitoring the in-place state of the hardware environment resources (Ethernet card, electronic disk, FC, etc.) operated by the computing node and the communication state of FC and Ethernet, forming a heartbeat message to be sent to the management node, wherein the monitoring period can be set in a unit of 5 milliseconds;
i) receiving resource monitoring query sent by a management node, namely CPU utilization rate, CPU temperature, memory capacity, electronic disk capacity, running states (normal, fault and suspended) of each application and resource occupation conditions of the applications, and forming a message to be fed back to the management node;
j) receiving a self-test result query sent by a management node, and sending a power-on self-test result of the computing node equipment to the management node;
k) acquiring switching information sent by a management node in real time, switching the standby application into a main application, carrying out external communication, deleting the main application, then re-creating and starting the main application, and starting the standby application;
l) providing an API interface to acquire the working state of the currently running application: is a main application or a standby application.
The loading module runs on each computing node and is specifically realized by adopting the following design:
e) reading script configuration file information during starting, and loading out application;
f) receiving out application transmitted by the management node, loading and running the application in a process mode, running a task according to a CPU core assigned by the management node, storing the application in the electronic disk, and updating the loaded application information in a configuration file;
g) after the operation is finished, sending loading completion information to the management node;
h) and receiving the vxworks mapping file transmitted by the management node, and storing the vxworks mapping file in the electronic disk boot partition.
The management module runs on the management node, manages the main application and the standby application through mutual communication with each computing node, responds to human-computer interaction information, and is specifically realized by adopting the following design:
f) module initialization: sending power-on self-check monitoring information to the computing nodes, receiving the self-check information, acquiring the equipment state of each computing node, alarming the computing nodes in the fault state, carrying out corresponding processing, and sending the equipment state information to an information recording task for recording;
g) and a human-computer interaction module interaction task: receiving human-computer interaction information, including submitting application information, updating mapping information, monitoring information and the like, and sending the information to an information processing task for processing;
h) and (3) information processing tasks: processing the submitted application information, assigning the computing nodes where the main application and the standby application are located in the application information, deploying according to the application information, if the computing nodes are not assigned, sending resource monitoring query information such as a CPU (central processing unit), a memory, an electronic disk and the like to the computing nodes, selecting the computing nodes with the least resource occupation as the main application running nodes and the standby application running nodes after the information is obtained, sending the configuration information to the corresponding computing nodes, and sending the computing node resource occupation information and the newly allocated main application running information and the newly allocated standby application running information to an information recording task; processing the updated mapping information, and sending a mapping file to the computing node to be updated;
i) switching processing tasks: the heartbeat information of the computing nodes, the main applications and the standby applications is periodically acquired, when the heartbeat is not received in more than 2 periods or the hardware state of the computing nodes in the heartbeat message is a fault, the computing nodes are judged to have the fault, the fault computing nodes are alarmed, and the applications running on the fault nodes are migrated to the computing nodes with sufficient resources according to the current resource occupation condition of the rest computing nodes; when the state of the main application in the heartbeat message is a fault or is suspended, judging that the main application has a fault, sending a switching instruction to the standby application to switch the standby application to the main application, sending the switching instruction to a node where the main application is located to enable the main application to be the standby application after being deleted and restarted, and sending the switched computing node information, the main application information and the standby application information to an information recording task, wherein the switching time of the main application and the standby application is the time of one heartbeat period;
j) and (3) information recording task: receiving state information and resource information of the computing nodes and main application information and standby application information which run on the computing nodes, and recording the information on an electronic disk to form a log;
the man-machine interaction module is used for providing a computing node management information display function for a user. The user can intuitively know the running state of each node in the whole system according to the visual interface of the human-computer interaction module, and simultaneously provides detailed information such as the CPU use condition, the memory use condition, the network flow information, the disk use condition and the like of each node at each moment through the graphic data, so that the user can conveniently master the whole state of the system.
In summary, the design implementation of the high-availability cluster management based on the real-time system provided by the invention realizes the high-availability cluster management on the real-time operating system through the device virtualization technology and the data synchronization technology, and performance indexes such as application switching, data migration, fault perception and the like all meet the requirements of the real-time system.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A method for designing a real-time high-availability cluster management system, which runs on a computing node and a management node and is designed for high-availability management of user applications running on the computing node.
2. The method of claim 1, wherein the real-time high availability cluster management system is designed to include a data communication module and an application management module, the data communication module is used for providing FC communication and gigabit ethernet communication data support for the computing nodes and the management nodes, and the application management module is used for performing data distribution management on the computing nodes and the management nodes and performing management on interaction control between the computing nodes and the management nodes.
3. The method according to claim 2, wherein the data communication module is designed to be composed of a driver module, the driver module provides an FC driver, a network card driver and a communication protocol, and the two communication data are fused and stored by creating a memory data queue, so that the data are transmitted and received in a unified virtual communication device manner.
4. The method of claim 3, wherein the application management module is designed to include a data synchronization module, a monitoring module, a loading module, a management module, and a human-machine interaction module;
the data synchronization module is designed to provide a real-time data synchronization mechanism, task data generated by a task system is stored in a local database and is uploaded to a management node through a network, the management node is distributed to a computing node to realize real-time data backup, and when a fault occurs, an integrated computing combination switches a database instance and a service route of a task access database to a backup node in real time; after the fault is removed, automatically adding a fault recovery node into an available sequence, and simultaneously backing up data to the fault recovery node in real time, so that the effect of normal uninterrupted synchronization of the data is finally achieved; meanwhile, both the master application and the standby application can receive external data, and the external data is simultaneously sent to the master application and the standby application; the main application can back up the key data to all the computing nodes for synchronously controlling the flow and the data;
the monitoring module is designed to provide a state monitoring function for the outside, runs on each computing node, communicates with the management node, and is used for periodically acquiring the hardware resource state, the application working state and the module self-checking information of each computing node in the system, and forming the monitoring information into heartbeat messages to be periodically sent to the management node;
the loading module runs on each computing node and is specifically realized by adopting the following design:
a) reading script configuration file information during starting, and loading out application;
b) receiving out application transmitted by the management node, loading and running the application in a process mode, running a task according to a CPU core assigned by the management node, storing the application in the electronic disk, and updating the loaded application information in a configuration file;
c) after the operation is finished, sending loading completion information to the management node;
d) receiving a vxworks mapping file transmitted by a management node, and storing the vxworks mapping file in a boot partition of the electronic disk;
the management module runs on the management node, manages the main application and the standby application through mutual communication with each computing node, and responds to the man-machine interaction information;
the man-machine interaction module is designed to provide a computing node management information display function for a user.
5. The method of claim 4, wherein the monitoring module is implemented by specifically adopting the following design:
a) periodically monitoring the state of each application running on the computing node, forming a heartbeat message and sending the heartbeat message to the management node;
b) periodically monitoring the in-place state of the hardware environment resource operated by the computing node and the FC and Ethernet communication states, forming a heartbeat message and sending the heartbeat message to the management node, wherein the monitoring period can be set by taking 5 milliseconds as a unit;
c) receiving resource monitoring query sent by a management node, CPU utilization rate, CPU temperature, memory capacity, electronic disk capacity, running state of each application and resource occupation situation of the application, and forming a message to be fed back to the management node;
d) receiving a self-test result query sent by a management node, and sending a power-on self-test result of the computing node equipment to the management node;
e) acquiring switching information sent by a management node in real time, switching the standby application into a main application, carrying out external communication, deleting the main application, then re-creating and starting the main application, and starting the standby application;
f) providing an API interface to acquire the working state of the currently running application: is a main application or a standby application.
6. The method of claim 5, wherein the management module is implemented by specifically adopting the following design:
a) module initialization: sending power-on self-check monitoring information to the computing nodes, receiving the self-check information, acquiring the equipment state of each computing node, alarming the computing nodes in the fault state, carrying out corresponding processing, and sending the equipment state information to an information recording task for recording;
b) and a human-computer interaction module interaction task: receiving human-computer interaction information, including submitting application information, updating mapping information, monitoring information and the like, and sending the information to an information processing task for processing;
c) and (3) information processing tasks: processing the submitted application information, assigning the computing nodes where the main application and the standby application are located in the application information, deploying according to the application information, if the computing nodes are not assigned, sending resource monitoring query information such as a CPU (central processing unit), a memory, an electronic disk and the like to the computing nodes, selecting the computing nodes with the least resource occupation as the main application running nodes and the standby application running nodes after the information is obtained, sending the configuration information to the corresponding computing nodes, and sending the computing node resource occupation information and the newly allocated main application running information and the newly allocated standby application running information to an information recording task; processing the updated mapping information, and sending a mapping file to the computing node to be updated;
d) switching processing tasks: the heartbeat information of the computing nodes, the main applications and the standby applications is periodically acquired, when the heartbeat is not received in more than 2 periods or the hardware state of the computing nodes in the heartbeat message is a fault, the computing nodes are judged to have the fault, the fault computing nodes are alarmed, and the applications running on the fault nodes are migrated to the computing nodes with sufficient resources according to the current resource occupation condition of the rest computing nodes; when the state of the main application in the heartbeat message is a fault or is suspended, judging that the main application has a fault, sending a switching instruction to the standby application to switch the standby application to the main application, sending the switching instruction to a node where the main application is located to enable the main application to be the standby application after being deleted and restarted, and sending the switched computing node information, the main application information and the standby application information to an information recording task, wherein the switching time of the main application and the standby application is the time of one heartbeat period;
e) and (3) information recording task: and receiving the state information and the resource information of the computing nodes and the main application information and the standby application information which run on the computing nodes, and recording the information on the electronic disk to form a log.
7. The method of claim 6, wherein the human-computer interaction module is specifically configured to provide, through the graph data, a CPU usage, a memory usage, network traffic information, and a disk usage of each node at each time.
8. The method of claim 5, wherein the monitoring period is settable in units of 5 milliseconds while monitoring the status of each application running on the compute node.
9. The method of claim 5, wherein the runtime hardware environment resources comprise an Ethernet card, an electronic disk, FC.
10. The method of claim 5, wherein the application running states include normal, failed, suspended.
CN201911022253.8A 2019-10-25 2019-10-25 Design method of real-time high-availability cluster management system Active CN110784350B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911022253.8A CN110784350B (en) 2019-10-25 2019-10-25 Design method of real-time high-availability cluster management system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911022253.8A CN110784350B (en) 2019-10-25 2019-10-25 Design method of real-time high-availability cluster management system

Publications (2)

Publication Number Publication Date
CN110784350A true CN110784350A (en) 2020-02-11
CN110784350B CN110784350B (en) 2022-04-05

Family

ID=69387834

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911022253.8A Active CN110784350B (en) 2019-10-25 2019-10-25 Design method of real-time high-availability cluster management system

Country Status (1)

Country Link
CN (1) CN110784350B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112003721A (en) * 2020-07-15 2020-11-27 苏州浪潮智能科技有限公司 Method and device for realizing high availability of large data platform management node
CN112084135A (en) * 2020-09-18 2020-12-15 西安超越申泰信息科技有限公司 High-reliability computer based on domestic processor
CN112131088A (en) * 2020-09-29 2020-12-25 北京计算机技术及应用研究所 High availability method based on health examination and container
CN112181660A (en) * 2020-10-12 2021-01-05 北京计算机技术及应用研究所 High-availability method based on server cluster
CN112241304A (en) * 2020-10-12 2021-01-19 北京计算机技术及应用研究所 Scheduling method and device for super-converged resources in Loongson cluster and Loongson cluster
CN112477919A (en) * 2020-12-11 2021-03-12 交控科技股份有限公司 Dynamic redundancy backup method and system suitable for train control system platform
CN113377702A (en) * 2021-07-06 2021-09-10 安超云软件有限公司 Method and device for starting two-node cluster, electronic equipment and storage medium
CN113743965A (en) * 2021-11-08 2021-12-03 中航信移动科技有限公司 Block chain-based civil aviation luggage consignment tracing method and device and electronic equipment
CN114598591A (en) * 2022-03-07 2022-06-07 中国电子科技集团公司第十四研究所 Embedded platform node fault recovery system and method
CN115904738A (en) * 2023-01-05 2023-04-04 摩尔线程智能科技(北京)有限责任公司 Management system and control method for data processing device cluster

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102231681A (en) * 2011-06-27 2011-11-02 中国建设银行股份有限公司 High availability cluster computer system and fault treatment method thereof
CN102629906A (en) * 2012-03-30 2012-08-08 浪潮电子信息产业股份有限公司 Design method for improving cluster business availability by using cluster management node as two computers
CN103795801A (en) * 2014-02-12 2014-05-14 浪潮电子信息产业股份有限公司 Metadata group design method based on real-time application group
CN103973811A (en) * 2014-05-23 2014-08-06 浪潮电子信息产业股份有限公司 High-availability cluster management method capable of conducting dynamic migration
KR20150123400A (en) * 2014-04-24 2015-11-04 남서울대학교 산학협력단 A Building Method of High-availability Mechanism of Medical Information Systems based on Clustering Algorism
CN105141456A (en) * 2015-08-25 2015-12-09 山东超越数控电子有限公司 Method for monitoring high-availability cluster resource
CN107832146A (en) * 2017-10-27 2018-03-23 北京计算机技术及应用研究所 Thread pool task processing method in highly available cluster system
CN108763310A (en) * 2018-04-25 2018-11-06 江苏鸣鹤云科技有限公司 A kind of big data platform of High Availabitity
CN110033095A (en) * 2019-03-04 2019-07-19 北京大学 A kind of fault-tolerance approach and system of high-available distributed machine learning Computational frame
CN110134518A (en) * 2019-05-21 2019-08-16 浪潮软件集团有限公司 A kind of method and system improving big data cluster multinode high application availability

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102231681A (en) * 2011-06-27 2011-11-02 中国建设银行股份有限公司 High availability cluster computer system and fault treatment method thereof
CN102629906A (en) * 2012-03-30 2012-08-08 浪潮电子信息产业股份有限公司 Design method for improving cluster business availability by using cluster management node as two computers
CN103795801A (en) * 2014-02-12 2014-05-14 浪潮电子信息产业股份有限公司 Metadata group design method based on real-time application group
KR20150123400A (en) * 2014-04-24 2015-11-04 남서울대학교 산학협력단 A Building Method of High-availability Mechanism of Medical Information Systems based on Clustering Algorism
CN103973811A (en) * 2014-05-23 2014-08-06 浪潮电子信息产业股份有限公司 High-availability cluster management method capable of conducting dynamic migration
CN105141456A (en) * 2015-08-25 2015-12-09 山东超越数控电子有限公司 Method for monitoring high-availability cluster resource
CN107832146A (en) * 2017-10-27 2018-03-23 北京计算机技术及应用研究所 Thread pool task processing method in highly available cluster system
CN108763310A (en) * 2018-04-25 2018-11-06 江苏鸣鹤云科技有限公司 A kind of big data platform of High Availabitity
CN110033095A (en) * 2019-03-04 2019-07-19 北京大学 A kind of fault-tolerance approach and system of high-available distributed machine learning Computational frame
CN110134518A (en) * 2019-05-21 2019-08-16 浪潮软件集团有限公司 A kind of method and system improving big data cluster multinode high application availability

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112003721B (en) * 2020-07-15 2022-10-14 苏州浪潮智能科技有限公司 Method and device for realizing high availability of large data platform management node
CN112003721A (en) * 2020-07-15 2020-11-27 苏州浪潮智能科技有限公司 Method and device for realizing high availability of large data platform management node
CN112084135A (en) * 2020-09-18 2020-12-15 西安超越申泰信息科技有限公司 High-reliability computer based on domestic processor
CN112131088A (en) * 2020-09-29 2020-12-25 北京计算机技术及应用研究所 High availability method based on health examination and container
CN112131088B (en) * 2020-09-29 2024-04-09 北京计算机技术及应用研究所 High availability method based on health examination and container
CN112181660A (en) * 2020-10-12 2021-01-05 北京计算机技术及应用研究所 High-availability method based on server cluster
CN112241304B (en) * 2020-10-12 2023-09-26 北京计算机技术及应用研究所 Loongson cluster super-fusion resource scheduling method and device and Loongson cluster
CN112241304A (en) * 2020-10-12 2021-01-19 北京计算机技术及应用研究所 Scheduling method and device for super-converged resources in Loongson cluster and Loongson cluster
CN112477919A (en) * 2020-12-11 2021-03-12 交控科技股份有限公司 Dynamic redundancy backup method and system suitable for train control system platform
CN113377702A (en) * 2021-07-06 2021-09-10 安超云软件有限公司 Method and device for starting two-node cluster, electronic equipment and storage medium
CN113377702B (en) * 2021-07-06 2024-03-22 安超云软件有限公司 Method and device for starting two-node cluster, electronic equipment and storage medium
CN113743965A (en) * 2021-11-08 2021-12-03 中航信移动科技有限公司 Block chain-based civil aviation luggage consignment tracing method and device and electronic equipment
CN114598591A (en) * 2022-03-07 2022-06-07 中国电子科技集团公司第十四研究所 Embedded platform node fault recovery system and method
CN114598591B (en) * 2022-03-07 2024-02-02 中国电子科技集团公司第十四研究所 Embedded platform node fault recovery system and method
CN115904738A (en) * 2023-01-05 2023-04-04 摩尔线程智能科技(北京)有限责任公司 Management system and control method for data processing device cluster

Also Published As

Publication number Publication date
CN110784350B (en) 2022-04-05

Similar Documents

Publication Publication Date Title
CN110784350B (en) Design method of real-time high-availability cluster management system
CN107707393B (en) Multi-active system based on Openstack O version characteristics
CN102103518B (en) System for managing resources in virtual environment and implementation method thereof
US10678648B2 (en) Method, apparatus, and system for migrating virtual machine backup information
CN108270726B (en) Application instance deployment method and device
CN106528327B (en) A kind of data processing method and backup server
WO2017067484A1 (en) Virtualization data center scheduling system and method
CN103647849A (en) Method and device for migrating businesses and disaster recovery system
CN102088490B (en) Data storage method, device and system
CN111343219B (en) Computing service cloud platform
CN107995043B (en) Application disaster recovery system based on hybrid cloud platform
CN105095317A (en) Distributive database service management system
US20120151095A1 (en) Enforcing logical unit (lu) persistent reservations upon a shared virtual storage device
US9148430B2 (en) Method of managing usage rights in a share group of servers
CN114138754A (en) Software deployment method and device based on Kubernetes platform
CN106961700B (en) Wireless communication method for dynamic remote fault-tolerant reconstruction of cluster avionics system computing resources
CN115292408A (en) Master-slave synchronization method, device, equipment and medium for MySQL database
CN111193610A (en) Intelligent monitoring data system and method based on Internet of things
CN106250048B (en) Manage the method and device of storage array
CN114996352B (en) Database management system and method
CN114666201B (en) High-availability distributed micro-service architecture
KR20140029644A (en) Distributed computing system and recovery method thereof
US9710298B2 (en) Information processing system, storage apparatus, and program
CN112799835A (en) Method and system for processing metadata of distributed database system
CN116010111B (en) Cross-cluster resource scheduling method, system and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant