CN113127270A - Cloud computing-based 2-out-of-3 safety computer platform - Google Patents

Cloud computing-based 2-out-of-3 safety computer platform Download PDF

Info

Publication number
CN113127270A
CN113127270A CN202110355059.2A CN202110355059A CN113127270A CN 113127270 A CN113127270 A CN 113127270A CN 202110355059 A CN202110355059 A CN 202110355059A CN 113127270 A CN113127270 A CN 113127270A
Authority
CN
China
Prior art keywords
host
data
synchronization
computer
mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110355059.2A
Other languages
Chinese (zh)
Other versions
CN113127270B (en
Inventor
唐涛
朱力
李松
王悉
王洪伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiaotong University
Original Assignee
Beijing Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaotong University filed Critical Beijing Jiaotong University
Priority to CN202110355059.2A priority Critical patent/CN113127270B/en
Publication of CN113127270A publication Critical patent/CN113127270A/en
Application granted granted Critical
Publication of CN113127270B publication Critical patent/CN113127270B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/203Failover techniques using migration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1629Error detection by comparing the output of redundant processing systems
    • G06F11/1633Error detection by comparing the output of redundant processing systems using mutual exchange of the output between the redundant processing components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1629Error detection by comparing the output of redundant processing systems
    • G06F11/165Error detection by comparing the output of redundant processing systems with continued operation after detection of the error
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1675Temporal synchronisation or re-synchronisation of redundant processing components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2028Failover techniques eliminating a faulty processor or activating a spare
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2035Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant without idle spare hardware

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention provides a cloud computing-based 2-out-of-3 secure computer platform. The method comprises the following steps: hierarchy from top to bottom: the system comprises a cloud management center, service nodes, a secure computer virtualization container and a physical infrastructure; the cloud management center is provided with one cloud management center, the service nodes are hosts, the cloud management center is in signaling and data communication with the three hosts respectively, the hosts are in one-to-one correspondence with the safety computer virtualization containers, the safety computer virtualization containers and the physical infrastructure, the hosts are in data communication with the corresponding safety computer virtualization containers, and the safety computer virtualization containers are in data communication with the corresponding physical infrastructure. The application and operation environment of the invention is containerized, light, easy to move and deploy; the distributed cloud management center realizes real-time monitoring, resource scheduling and platform self-diagnosis of lower-layer physical service nodes, immediately recovers faults and inherits historical variable and state data; the platform provides 3 and takes 2 the basic functions of the safety computer, and can also develop peripheral application.

Description

Cloud computing-based 2-out-of-3 safety computer platform
Technical Field
The invention relates to the technical field of security computers, in particular to a 3-out-of-2 security computer platform based on cloud computing.
Background
The safety computer technology relates to the fields of rail transit, aerospace and the like. The technology is used for guaranteeing the correctness of input, output and intermediate states of equipment or application, and a multi-mode redundancy mode is mostly adopted.
In the field of rail transit, ground equipment and vehicle-mounted equipment are both composed of safety computers. When the equipment is in emergency failure due to physical reasons or other reasons, another set of system or emergency treatment scheme needs to be designed to timely record the failure state and restore the safety of the equipment, namely the principle of failure safety must be followed: the system state can be guided to safety in case of failure.
In terms of architecture design, the internal architecture of the secure computer platform generally adopts a dual-channel structure (2 is multiplied by 2 to obtain 2) or a multi-channel structure (3 is obtained by 2), and a plurality of channels monitor each other and vote respective input and output to judge the normality or abnormality of each channel. The architecture mainly comprises three modules, namely a data communication module, a synchronization module among channels and an input/output two-out-of-three voting module.
At present, the security computer platform with 2 software and hardware in the prior art has the following defects:
1) the cost of the number of the board cards or the host computers is increased due to the redundancy design concept.
The german SIEMENS is based on the SICAS system with the two-out-of-three structure and the SelTrac system based on the two-out-of-three structure of the french company, which both include the safety computer based on the redundancy design concept, and the multi-channel redundancy design method inevitably causes the number of board cards or hosts to be increased by times, so that the whole set of safety computer equipment occupies one or more cabinets enough.
2) The board card is bound with the software, and the failure of the hardware and the software can cause the functional failure of the safety computer.
The general hardware of the three-out-of-two-use safety computer mainly comprises a plurality of modules such as a CPU processor module, a memory module, a power supply module, a peripheral circuit and the like. Physical failure of each module increases the probability of a failure of the secure computer function.
3) Maintenance replacement causes interruption of application services.
The safety computer platform hardware has a certain mean time to failure, namely the service life is limited. Once the equipment fails or the hardware ages, the time required for maintaining the update inevitably causes a part of the safety computer to fail, thereby causing application service interruption.
Disclosure of Invention
Embodiments of the present invention provide a 3 out of 2 secure computer platform based on cloud computing to overcome the problems of the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme.
A cloud computing-based 2-out-of-3 secure computer platform, comprising: the system comprises a cloud management center, service nodes, a safety computer virtualization container and a physical infrastructure, wherein the cloud management center, the service nodes, the safety computer virtualization container and the physical infrastructure are of a layered architecture and are sequentially arranged from top to bottom; the cloud management center is provided with one cloud management center, the service nodes are hosts, the cloud management center is in signaling and data communication with the three hosts respectively, the hosts are in one-to-one correspondence with the safety computer virtualization containers, the safety computer virtualization containers are in one-to-one correspondence with the physical infrastructures, the hosts are in data communication with the corresponding safety computer virtualization containers, and the safety computer virtualization containers are in data communication with the corresponding physical infrastructures.
Preferably, independent operation is carried out among the three host computer structures, a loose coupling redundant structure is achieved among the three host computer structures based on task level synchronization, and data exchange is carried out through a virtual network technology; a voting mechanism of 2 out of 3 is adopted among the three host computer structures, and only the host computer in the main mode can send information to other external equipment.
Preferably, the cloud management center is of a distributed structure, can be used for geographic disaster recovery and defending single-point faults, and does not interrupt monitoring of service nodes and user application processes; and after the communication link between any two hosts is interrupted, the data is forwarded through the third host, so that the normal operation of data voting is ensured.
Preferably, after the distributed cloud management center and the three service nodes are deployed, the configuration environment and the software main body required by the application are packaged into a mirror image through a container virtualization technology, the mirror image is deployed on the cloud computing platform, the application container of the security computer platform is started through the mirror image, and the mirror image can be migrated and started at any time.
Preferably, each host preempts the primary and secondary priorities according to the power-on sequence, and when the host fails or recovers, the primary and secondary priorities of the three hosts are updated according to the initial state and the identity switching strategy;
the working modes of the host comprise five working modes as follows:
1) a power-on mode: the host is in a power-on starting stage, and sends synchronous requests to the other two hosts after power-on, the host powered on first receives the largest number of synchronous requests, and the host is in a main working mode;
2) the main working mode is as follows: the host computer is in a normal working state, the calculation result of the host computer is at least consistent with the calculation result of one other host computer, and the calculation result of the host computer is used as the only output result of the whole system;
3) standby operation mode: the host computer is in a normal working state, the calculation result of the host computer is at least consistent with the calculation result of one other host computer, but the host computer does not output the calculation result outwards;
4) following mode: the host is powered on again due to faults and started, if the execution of the identity strategy is finished, the host enters a following mode, and under the following mode, the host needs to wait for historical state information sent by the host in a main working state, complete inheritance learning of historical data information and then enter a standby working mode to operate;
5) resetting mode: when the host is in failure or the voting result is inconsistent with the other two machines, the host enters a reset mode.
6. The cloud-computing-based 2-out-of-3 secure computer platform of claim 5, wherein in power-on mode, the synchronization decision logic truth table followed by the host is as shown in Table 2:
TABLE 2
Number of times of receiving synchronization request Number of times of receiving synchronization signal Synchronization result
2 0 The synchronization is successful, and the host is the first power-on host
1 1 The synchronization is successful, and the host is a second power-on host
0 1 The synchronization is successful, and the host is a third power-on host
0 0 Synchronization failure
Preferably, the host enters a power-on mode after being started, when the power-on mode is adopted, each host of the 2-out-of-3 safety redundancy system firstly carries out initial power-on synchronization once, each host sends synchronization requests to the other two hosts when being started, each host counts the number of the received synchronization requests, the identity of each host is switched according to the number of the synchronization requests, and the host receiving the most synchronization requests is the host in the main mode;
the main mode host sends a synchronization signal to the other two machines, starts a task period, and carries out one-time general task synchronization on each host in each task period;
the initial power-on synchronization is performed once when the fault recovery host is started, so as to determine the initial identity of each host.
Preferably, when each host exchanges data with the other two machines, the input and output data and the intermediate state information are voted, and the voting mode includes bit-by-bit voting, selective voting and median voting:
the median vote is that the input data of each host are inconsistent, and the output data of each host are consistent; the selection voting is that the data to be compared in each host are not completely the same, and each host outputs consistent data in the three-host intersection; the bitwise voting is to compare the two host data for data exchange bit by bit and keep the two host data consistent.
Preferably, the platform performs fault self-diagnosis by adopting a health check mechanism, performs periodic state check on the running state of the application inside the platform in a TCP, exec or HTTP mode, initiates a link request through TCP and HTTP, checks the normal opening of an application IP address + port, executes a custom diagnosis script through exec, monitors the application state and triggers self-starting recovery, and restarts recovery when the state is abnormal.
Preferably, after the faulty host is maintained and powered on again, state following data is acquired from the normally running host by a state following mechanism in a socket mode, and data recovery and inheritance are performed according to the state following data;
the state following data includes:
1) the timestamp and cycle number information of the main mode host at the moment of sending the historical information;
2) inputting application data;
3) communication link management table related information;
4) intermediate state data is applied.
According to the technical scheme provided by the embodiment of the invention, the application and running environment of the invention is containerized and packaged, and the invention is light in weight and easy to migrate and deploy; the distributed cloud management center realizes real-time monitoring, resource scheduling and platform self-diagnosis of lower-layer physical service nodes, immediately recovers faults and inherits historical variable and state data; the platform can develop peripheral application besides providing 2-out-of-3 safety computer basic functions.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of a 3-out-of-2 secure computer platform architecture based on cloud computing according to an embodiment of the present invention;
fig. 2 is an identity switching process triggered when a certain computer fails, for a 3-out-of-2 secure computer provided in an embodiment of the present invention.
Fig. 3 is a flow chart of a synchronization module according to an embodiment of the present invention, including initial power-on synchronization and general task synchronization.
Fig. 4 is a flow chart of a voting module according to an embodiment of the present invention, including data exchange, synchronous voting, and output.
Fig. 5 is a 3-out-of-2 secure computer software application package starting process according to an embodiment of the present invention, which includes three steps of packaging a mirror image by using a Docker containerization technique, allocating computing storage network resources, and starting a container.
Fig. 6 is a health check and state following execution flow designed for cloud computing characteristics according to an embodiment of the present invention, which includes two failure situations, namely a virtual host failure and a service node failure. The overlay network can provide a unique virtual subnet of the whole cluster for each physical node and provide a routing function for the virtual host, and if a certain physical node fails, the overlay network can maintain and update a routing table to enable the IP of the virtual host on the failed node to be constantly migrated to the normal physical node.
Fig. 7 is an execution flow of a state following mechanism according to an embodiment of the present invention, which includes three steps of following a request, identity switching, and data inheritance.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.
With the development of information technology, cloud computing is an innovative service mode of information technology in the present time, and has become a key information infrastructure supporting the development of various industries by virtue of the characteristics of super-large scale, virtualization, high reliability, universality, high scalability, on-demand service and the like. Cloud computing has become a development trend of the current era and is a development direction of rail transit application in the future.
In 2019, the construction and development of cloud computing in actual urban rail lines are continuously shown in the front of the masses as spring shoots after rain. At present, cities with large scale operated line networks such as Beijing, Shanghai, Guangzhou, Shenzhen and Wuhan are deployed and promote the construction of urban rail clouds, and emerging subways such as Hohaote and Taiyuan are also used for constructing cities.
In 9 months of 2019, the first urban rail cloud project of a global multiline multi-service system is set up in a harmonious manner, a production center cloud platform, a disaster recovery center cloud platform and a station section cloud platform are designed from the top, IaaS (infrastructure as a service) service is provided for multiple systems, and the construction requirements of No. 1 and No. 2 lines of harmonious manner rail transit are met.
In 2019, 20 days in 5 months, Zheng state opens and operates a first ANCC cloud platform based on line network level in China by fusing cloud, 5G and Internet of things technologies and taking the cloud, 5G and Internet of things technologies as technical support of an intelligent subway, and deeply fuses a clearing center and a line center.
It can be seen from the above examples that cloud computing has been used as another development direction in the field of rail transit, so the present invention migrates the security computer platform, which is one of the core components of rail transit, to the cloud through the cloud computing technology, and performs adaptive improvement on the security computer platform according to the characteristics of the cloud computing technology.
Fig. 1 shows a schematic diagram of a 3-out-of-2 secure computer platform architecture based on cloud computing. The safety computer platform consists of a distributed cloud management center, service nodes, a safety computer virtualization container and a physical infrastructure, and is a layered architecture. One cloud management center is provided, and the service node is a host. The number of the hosts, the safety computer virtualization container and the physical infrastructure are three, the cloud management center is in signaling and data communication with the three hosts (the first host, the second host and the third host), the hosts correspond to the safety computer virtualization container one by one, and the safety computer virtualization container corresponds to the physical infrastructure one by one. The host computer is in data communication with the corresponding secure computer virtualization container, which is in data communication with the corresponding physical infrastructure.
In the embodiment of the invention, the software of the 2 safe computer platforms is designed to provide a software working platform for a safe demanding system, and the functions of communication, application calculation, fault tolerance and safety are completed. And the three groups of corresponding service nodes, the safety computer virtualization container and the physical foundation form a three-host structure which is arranged in parallel. The three host structures are independently operated to avoid common mode faults. The three host computer structures achieve a loose coupling redundant structure based on task level synchronization, and data exchange is carried out through a virtual network technology. A voting mechanism of 2 out of 3 is adopted between the three host structures, so that the safety, the usability and the maintainability of the platform are ensured. Only the host in the main mode can send information to other external devices, so that the uniqueness of output is ensured.
The cloud management center is of a distributed structure, can be used for geographic disaster recovery and defending single-point faults, and does not interrupt monitoring on service nodes and user application processes. And after the communication link between any two hosts is interrupted, the data can be forwarded through the third host, so that the normal operation of data voting is ensured.
In the embodiment of the invention, the software design of the 2-out-of-3 secure computer platform is packaged by a Docker container technology, and then is deployed on a cloud computing platform, so that the software design can be scheduled by the cloud platform. In order to adapt to the characteristics of cloud computing virtualization (network, storage and computing resource virtualization), high reliability (data multi-copy fault tolerance, isomorphic service nodes and task handover) and expandability (dynamic cluster scale expansion), the embodiment of the invention designs a health check and state following mechanism. The health check is used for detecting and guaranteeing the application life cycle of the safety computer platform, so that the safety computer platform can be automatically restarted after abnormal interruption. However, the platform application data after restarting is destroyed along with the fault restart, so the embodiment of the invention designs a state following mechanism, and the fault restart host inherits the historical application data to the normal working host, thereby ensuring that the fault restart host can be on-line immediately and restore to provide services. For the 2-out-of-3 safety computer platform in the embodiment of the invention, in addition to the advantages that the platform can still vote and run normally when any one of the three hosts fails, the platform failure host can be quickly brought online and the integrity of the platform can be recovered after the failure occurs, and the operation and maintenance work is reduced.
The embodiment of the invention designs a 2-out-of-3 safety computer platform on a PaaS cloud platform-Kubernets. After the distributed cloud management node (cloud management center) and the three service nodes are deployed, the configuration environment and the software main body required by the application can be packaged into a mirror image through the Container virtualization technology such as Docker and LXC (Linux Container), and then the mirror image is deployed on the cloud computing platform. The image can be migrated at any time and the application can be started quickly. The specific implementation mode is shown as the following flow: the cloud computing platform is built, 3, 2 safe computer software is taken, three modules are designed, Docker container technology is used for packaging the safe computer platform software into a mirror image, the safe computer platform application container is started through the mirror image, health check and state follow real-time platform monitoring are achieved.
In terms of hardware architecture, the bottom hardware support (for constructing the cloud computing platform) of the safety computer platform based on the cloud computing technology only needs at least four physical servers (one cloud management node and three service nodes) and at most six physical service nodes (three management nodes and three service nodes), wherein a management node cluster forms a cloud management center. The physical configuration of the management node and the service node is shown in table 1:
table 1 physical configuration of management nodes and service nodes
System for controlling a power supply CentOS7 x64
CPU >2 nucleus
Memory device >2G
Storing >20Gib
And each host preempts the primary and standby priorities according to the power-on sequence (starting sequence), and updates the primary and standby priorities of the three hosts according to the initial state and the identity switching strategy when the host fails or recovers. According to the idea of the safety core, the periodic control mode of the platform software is divided into a plurality of micro periods, the communication link state is self-diagnosed when each micro period is finished, and the communication link state is reported to the cloud pipe center through the log system, so that the failure safety response time is shortened.
Before designing a software module, in order to distinguish a normally-operating host from a host after fault recovery and execute an identity switching strategy, the embodiment of the invention sets up five operating modes as follows:
1) a power-on mode: the host is in a power-on starting stage, the host follows a preemption principle and is shown in a table 2, after being powered on, the host immediately sends synchronization requests to the other two hosts, and the host powered on first receives the largest number of the synchronization requests, namely the host working mode;
2) the main working mode is as follows: when the host computer is in a normal working state, the calculation result of the host computer is at least consistent with the calculation result of one other host computer, and the calculation result of the host computer is used as the only output result of the whole system;
3) standby operation mode: when the host computer is in a normal working state, the calculation result of the host computer is at least consistent with the calculation result of one other host computer, but the host computer does not output the calculation result outwards;
4) following mode: and 3, taking 2, electrifying and starting one host in the safety computer platform again due to faults, and entering a following mode if the execution of the identity strategy is finished. In the following mode, the host computer needs to wait for the historical state information sent by the host computer in the main working state to complete the inheritance learning of the historical data information, and then can enter the standby working mode to operate.
5) Resetting mode: and 3, taking 2, when one host in the safety redundancy system is in failure or the voting result is inconsistent with the other two hosts, the host enters a reset mode.
Based on the five operating modes, an identity switching process triggered when a certain computer fails for a 3-out-of-2 secure computer provided by the embodiment of the present invention is shown in fig. 2.
The host in the active working mode and the standby working mode can provide normal application processing functions, and the host in other working modes cannot provide normal application processing functions.
TABLE 2 host power-on synchronization judgment logic truth table
Number of times of receiving synchronization request Number of times of receiving synchronization signal Synchronization result
2 0 The synchronization is successful, and the host is the first power-on host
1 1 The synchronization is successful, and the host is a second power-on host
0 1 The synchronization is successful, and the host is a third power-on host
0 0 Synchronization failure
In the design of the software module, the integrity of three functional modules, namely a data communication module, a synchronization module and a two-out-of-three voting module of the safety computer platform is reserved.
The data communication module is different from ethernet communication and adopts overlay network technology, namely, a layer of virtualization network is superposed on a physical network architecture, namely, an overlay network. Through overlay technology, a new virtual subnet can be added on the basis of a service node subnet, for example, 10.244.159.0/36 virtual subnet can be set on 192.168.1.0/36 physical subnet, so that the independence and isolation of a software network environment of a secure computer platform are realized. Under the overlay network, a socket communication protocol is still used for communication interaction between hosts, and communication delay is about 0.14 ms. In the invention, the overlay network plays the roles of a virtual switch and a virtual router, wherein the virtual switch refers to that the overlay network distributes a unique virtual subnet in a platform for each physical service node, and the virtual router refers to that the overlay network on each service node maintains a routing table together, so that virtual hosts on each service node can access each other.
Task level synchronization includes initial power-up synchronization and general task synchronization. And (3) entering a power-on mode after the host is started, and when the power-on mode is adopted, performing primary general task synchronization on three hosts of the 2-out-of-2 safety redundant system, namely initial power-on synchronization, wherein the hosts can continue to run downwards on the premise of finishing the initial power-on synchronization. And the common task synchronization is performed once in each task period, so that the synchronization correction is performed, and the accumulated software clock synchronization error is cleared.
Fig. 3 is an execution flow of a synchronization module according to an embodiment of the present invention. The synchronization module includes initial power-on synchronization and general task synchronization. The initial power-on synchronization is performed once when the initial startup host and the fault recovery host are started to determine the initial identity of each host, and the sent synchronization information comprises a synchronization request and a synchronization pulse signal. Each host computer sends a synchronization request to the other two host computers when being started, each host computer counts the number of the synchronization requests received by the host computer, switches the identity of the host computer according to the number of the synchronization requests, and the host computer receiving the most synchronization requests is the main mode host computer and is responsible for outputting the synchronization requests to the outside. At this time, the main host in the main mode sends a synchronous signal to the other two machines, and starts a task cycle. Meanwhile, in order to distinguish the power-on restart of the fault host, the synchronous signal frame sent by the main host also comprises the identity information of the three hosts. The synchronization mode is loose synchronization, which is software-form synchronization, but is different from general software synchronization, and the platform of the embodiment of the invention corrects the synchronization time (general task synchronization) again by taking the main host as the standard after one synchronization period is finished, namely, the clock error accumulated by synchronization is eliminated.
Fig. 4 is a flow chart of a voting module according to an embodiment of the present invention, including data exchange, synchronous voting, and output. And the three-out-of-two voting module adopts three data comparison algorithms. Different from a hardware voting mode of a traditional safety computer, the cloud computing-based 2-out-of-3 safety computer platform uses pure software voting, voting module software and hardware are decoupled, and each host needs to exchange data with other two computers, so that input and output data and other necessary intermediate state information are voted. And in the data voting process, two identical data are selected from three data in total from the data from the local machine and the data from the other two machines according to a pairwise comparison principle and are used as the output of the whole system.
The voting mode of the embodiment of the invention comprises bitwise voting, selective voting and median voting, and various voting modes are respectively introduced as follows:
1) the input data are not consistent but it is necessary to ensure that the output data are consistent (median comparison).
For time stamps and random numbers, considering clock drift and randomness of a processor, data generated by each machine cannot be guaranteed to be consistent, so the data is classified as data (1), although clock drift exists, influence caused by the drift can be tolerated in one period, and a method for obtaining a median value is adopted for processing, namely D-D (D-D)1+D2+D3) And/3, so that the data comparison is carried out to obtain the consistent data of all the hosts. Of course, the maximum value D ═ Max (D) may be used according to the requirements and data characteristics of the actual application1,D2,D3) Min (D) is the minimum value D1,D2,D3) Or other algorithms.
2) The data to be compared in the three hosts are not completely the same, and the consistent data in the intersection is required to be output (selective comparison).
Considering that the three hosts can not be completely synchronized, a certain period is allowed to receive multi-frame data from the same communication object (with old and new points, otherwise, the data is processed as redundant data). In this case, the data required to be provided to the upper layer application is the latest data that can be successfully provided, and it is ensured that the upper layer application processes the latest trusted data.
3) The dual-computer data are required to be strictly consistent, and bit-by-bit comparison (bit-by-bit comparison) is required.
The bit-by-bit comparison means that the two parties to be compared can only output the result if the two parties are completely consistent. If only one bit in the data to be compared is inconsistent, the comparison is returned to fail, and the data cannot be output.
And 3, after the three software modules of the 2 safe computers are designed, packaging the corresponding software into a mirror image through a Docker container technology, wherein the mirror image is required for packaging. After the mirror image encapsulation is finished, the get 3 and get 2 secure computer platform of the present invention can build a container as shown in fig. 5 by importing the encapsulated mirror image and carrying corresponding resources (memory, storage and network resources), and further start the get 3 and get 2 virtual host.
According to the container technology, the invention can realize the quick online of the virtual host on any physical node.
Finally, in order to meet the fault safety principle, namely the fault guiding safety principle and the adaptability improvement of the cloud computing technology, the invention designs a health check mechanism and a state following mechanism, which can be recovered to a safe state after the host fault occurs and inherit the currently applied variable and state data. Fig. 6 is a health check and state following execution flow designed for cloud computing characteristics according to an embodiment of the present invention, which includes two failure situations, namely a virtual host failure and a service node failure. The overlay network can provide a unique virtual subnet of the whole cluster for each physical node and provide a routing function for the virtual host, and if a certain physical node fails, the overlay network can maintain and update a routing table to enable the IP of the virtual host on the failed node to be constantly migrated to the normal physical node.
The health check mechanism is a self-diagnosis mode, the running state of the application in the platform is periodically checked in a TCP (transmission control protocol), exec and HTTP (hyper text transport protocol) mode, a link request is initiated through the TCP and the HTTP, and the normal opening of an application IP address and a port is checked. The self-defined diagnosis script can be executed through exec, the application state is monitored, self-starting recovery is triggered, and recovery is restarted when the state is abnormal. Under the mechanism, three hosts keep the number of running hosts to be three all the time.
Fig. 7 is an execution flow of a state following mechanism according to an embodiment of the present invention, which includes three steps of following a request, identity switching, and data inheritance. The state following mechanism mainly aims at solving the problem of data inheritance, because the data of the internal application service of the 2-out-of-3 secure computer platform is updated very frequently, if the data is in butt joint with a database, the normal operation of the application can be influenced by frequent interaction between the application and the database, and the application with high precision requirement, large resource occupation and much voting data is very unfavorable for the application with high precision requirement, large resource occupation and much voting data, when a fault host is maintained and electrified again, the identity modes of all hosts are updated according to an identity switching strategy, at the moment, the fault restart host mode is a following mode, then the waiting state following is started, variable data and the internal application running state are obtained from the normal operation host in a socket mode, and data recovery and inheritance are carried out. On the premise of recording all host identities, the main mode host collects and sends historical application variables or state data to the fault restart host when the current task cycle is finished, and then when the next cycle comes, the fault restart host and the normal working host synchronously run after the normal tasks are synchronized. Therefore, unlike the database storing application data, the state following only needs one interaction to solve the data inheritance problem.
The data information followed by the state comprises the following contents:
1) and sending the timestamp and cycle number information of the main mode host at the moment of sending the historical information. After receiving the history information, the time correction work is finished firstly, namely the timestamp and the cycle number of the host are adjusted to be consistent with those of the host in the main mode. Therefore, the time stamp of the host can be ensured to occur, the cycle number is kept within an allowable range, and the wrong judgment on the validity of the message due to the time stamp and the cycle number is avoided.
2) Application data is input.
3) Communication link management table related information.
4) Other necessary applications intermediate state data.
In summary, the application and operation environment of the embodiment of the invention is containerized, light, and easy to migrate and deploy; the platform self-diagnoses, the fault is immediately recovered, and the historical variable and the state data are inherited; the distributed cloud management center can realize real-time monitoring and resource scheduling of lower-layer physical service nodes; geographic disaster tolerance, single-point failure prevention, one-machine failure without affecting the function of the two-out-of-three safety computer, and normal work recovery within about 3 s; the platform can be expanded, and can develop peripheral applications such as network flow, host identity modes, memory CPU occupancy rate and the like of the front-end display security computer platform besides providing the basic functions of the 2-out-of-3 security computer.
Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A cloud computing-based 2-out-of-3 secure computer platform, comprising: the system comprises a cloud management center, service nodes, a safety computer virtualization container and a physical infrastructure, wherein the cloud management center, the service nodes, the safety computer virtualization container and the physical infrastructure are of a layered architecture and are sequentially arranged from top to bottom; the cloud management center is provided with one cloud management center, the service nodes are hosts, the cloud management center is in signaling and data communication with the three hosts respectively, the hosts are in one-to-one correspondence with the safety computer virtualization containers, the safety computer virtualization containers are in one-to-one correspondence with the physical infrastructures, the hosts are in data communication with the corresponding safety computer virtualization containers, and the safety computer virtualization containers are in data communication with the corresponding physical infrastructures.
2. The cloud computing-based secure computer platform of claim 1, wherein the three host computer structures independently operate, a loosely coupled redundant structure is achieved between the three host computer structures based on task level synchronization, and data exchange is performed through a virtual network technology; a voting mechanism of 2 out of 3 is adopted among the three host computer structures, and only the host computer in the main mode can send information to other external equipment.
3. The cloud computing-based secure computer platform of claim 1, wherein the cloud management center is a distributed structure, and is capable of geographically disaster recovery and defending against single-point failures, and monitoring service nodes and user application processes is uninterrupted; and after the communication link between any two hosts is interrupted, the data is forwarded through the third host, so that the normal operation of data voting is ensured.
4. The cloud-computing-based secure computer platform of claim 1, wherein after a distributed cloud management center and three service nodes are deployed, a configuration environment and a software main body required by an application are packaged into a mirror image through a container virtualization technology, the mirror image is deployed on the cloud computing platform, the secure computer platform application container is started through the mirror image, and the mirror image can be migrated and started at any time.
5. The cloud computing-based 2-out-of-3 secure computer platform of any one of claims 1 to 4, wherein each host preempts the primary and secondary priorities according to a power-on sequence, and updates the primary and secondary priorities of three hosts according to an initial state and an identity switching policy during failure and recovery;
the working modes of the host comprise five working modes as follows:
1) a power-on mode: the host is in a power-on starting stage, and sends synchronous requests to the other two hosts after power-on, the host powered on first receives the largest number of synchronous requests, and the host is in a main working mode;
2) the main working mode is as follows: the host computer is in a normal working state, the calculation result of the host computer is at least consistent with the calculation result of one other host computer, and the calculation result of the host computer is used as the only output result of the whole system;
3) standby operation mode: the host computer is in a normal working state, the calculation result of the host computer is at least consistent with the calculation result of one other host computer, but the host computer does not output the calculation result outwards;
4) following mode: the host is powered on again due to faults and started, if the execution of the identity strategy is finished, the host enters a following mode, and under the following mode, the host needs to wait for historical state information sent by the host in a main working state, complete inheritance learning of historical data information and then enter a standby working mode to operate;
5) resetting mode: when the host is in failure or the voting result is inconsistent with the other two machines, the host enters a reset mode.
6. The cloud-computing-based 2-out-of-3 secure computer platform of claim 5, wherein in power-on mode, the synchronization decision logic truth table followed by the host is as shown in Table 2:
TABLE 2
Number of times of receiving synchronization request Number of times of receiving synchronization signal Synchronization result 2 0 The synchronization is successful, and the host is the first power-on host 1 1 The synchronization is successful, and the host is a second power-on host 0 1 The synchronization is successful, and the host is a third power-on host 0 0 Synchronization failure
7. The cloud computing-based secure computer platform of claim 5, wherein the hosts enter a power-on mode after being started, and when the host is in the power-on mode, each host of the secure redundancy system of claim 2 performs initial power-on synchronization once, each host sends synchronization requests to the other two hosts when being started, each host counts the number of the synchronization requests received by itself, switches the identity of the host according to the number of the synchronization requests, and the host receiving the most synchronization requests is the host in the primary mode;
the main mode host sends a synchronization signal to the other two machines, starts a task period, and carries out one-time general task synchronization on each host in each task period;
the initial power-on synchronization is performed once when the fault recovery host is started, so as to determine the initial identity of each host.
8. The cloud computing-based 2-out-of-3 secure computer platform as recited in claim 5, wherein when each host exchanges data with the other two hosts, input and output data and intermediate status information are voted in a manner including bit-by-bit voting, selective voting and median voting:
the median vote is that the input data of each host are inconsistent, and the output data of each host are consistent; the selection voting is that the data to be compared in each host are not completely the same, and each host outputs consistent data in the three-host intersection; the bitwise voting is to compare the two host data for data exchange bit by bit and keep the two host data consistent.
9. The cloud computing-based 2-out-of-3 security computer platform as claimed in claim 5, wherein the platform employs a health check mechanism to perform fault self-diagnosis, performs periodic state check on the running state of applications inside the platform in a TCP, exec or HTTP manner, initiates a link request through TCP and HTTP, checks the normal opening of application IP address + port, executes a custom diagnosis script through exec, monitors the application state and triggers self-start recovery, and restarts recovery when the state is abnormal.
10. The cloud computing-based 2-out-of-3 secure computer platform as claimed in claim 5, wherein after a failed host is maintained and powered up again, a state following mechanism is used to obtain state following data from a normally running host in a socket manner, and data recovery and inheritance are performed according to the state following data;
the state following data includes:
1) the timestamp and cycle number information of the main mode host at the moment of sending the historical information;
2) inputting application data;
3) communication link management table related information;
4) intermediate state data is applied.
CN202110355059.2A 2021-04-01 2021-04-01 Cloud computing-based 3-acquisition-2 secure computer platform Active CN113127270B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110355059.2A CN113127270B (en) 2021-04-01 2021-04-01 Cloud computing-based 3-acquisition-2 secure computer platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110355059.2A CN113127270B (en) 2021-04-01 2021-04-01 Cloud computing-based 3-acquisition-2 secure computer platform

Publications (2)

Publication Number Publication Date
CN113127270A true CN113127270A (en) 2021-07-16
CN113127270B CN113127270B (en) 2023-06-27

Family

ID=76774512

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110355059.2A Active CN113127270B (en) 2021-04-01 2021-04-01 Cloud computing-based 3-acquisition-2 secure computer platform

Country Status (1)

Country Link
CN (1) CN113127270B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114827148A (en) * 2022-04-28 2022-07-29 北京交通大学 Cloud security computing method and device based on cloud fault-tolerant technology and storage medium
WO2023005777A1 (en) * 2021-07-29 2023-02-02 西门子交通技术(北京)有限公司 2*2oo2 security system based on cloud platform
CN116156860A (en) * 2023-02-22 2023-05-23 北京航天发射技术研究所 Electromagnetic compatibility optimization method for synchronous servo controller of electrically-driven special vehicle
CN116881920A (en) * 2023-06-27 2023-10-13 北京城建智控科技股份有限公司 Safety voting system and method based on code simulator
WO2024082174A1 (en) * 2022-10-19 2024-04-25 宁德时代未来能源(上海)研究院有限公司 Abnormality processing method and two-out-of-three protection device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102833314A (en) * 2012-07-27 2012-12-19 合肥华云通信技术有限公司 Cloud public service platform
WO2017049997A1 (en) * 2015-09-25 2017-03-30 华为技术有限公司 Virtual machine monitoring method, apparatus and system based on cloud computing service
CN107247644A (en) * 2017-07-03 2017-10-13 上海航天控制技术研究所 A kind of reconstruct down method of triple redundance computer system
CN110784539A (en) * 2019-10-29 2020-02-11 深圳供电局有限公司 Data management system and method based on cloud computing
CN111541599A (en) * 2020-04-24 2020-08-14 山东山大电力技术股份有限公司 Cluster software system and method based on data bus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102833314A (en) * 2012-07-27 2012-12-19 合肥华云通信技术有限公司 Cloud public service platform
WO2017049997A1 (en) * 2015-09-25 2017-03-30 华为技术有限公司 Virtual machine monitoring method, apparatus and system based on cloud computing service
CN107247644A (en) * 2017-07-03 2017-10-13 上海航天控制技术研究所 A kind of reconstruct down method of triple redundance computer system
CN110784539A (en) * 2019-10-29 2020-02-11 深圳供电局有限公司 Data management system and method based on cloud computing
CN111541599A (en) * 2020-04-24 2020-08-14 山东山大电力技术股份有限公司 Cluster software system and method based on data bus

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
任维贺: ""基于私有云的安全计算机关键技术研究"", 《中国优秀硕士学位论文全文数据库工程科技Ⅱ辑》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023005777A1 (en) * 2021-07-29 2023-02-02 西门子交通技术(北京)有限公司 2*2oo2 security system based on cloud platform
CN114827148A (en) * 2022-04-28 2022-07-29 北京交通大学 Cloud security computing method and device based on cloud fault-tolerant technology and storage medium
CN114827148B (en) * 2022-04-28 2023-01-03 北京交通大学 Cloud security computing method and device based on cloud fault-tolerant technology and storage medium
WO2024082174A1 (en) * 2022-10-19 2024-04-25 宁德时代未来能源(上海)研究院有限公司 Abnormality processing method and two-out-of-three protection device
CN116156860A (en) * 2023-02-22 2023-05-23 北京航天发射技术研究所 Electromagnetic compatibility optimization method for synchronous servo controller of electrically-driven special vehicle
CN116156860B (en) * 2023-02-22 2024-03-08 北京航天发射技术研究所 Electromagnetic compatibility optimization method for synchronous servo controller of electrically-driven special vehicle
CN116881920A (en) * 2023-06-27 2023-10-13 北京城建智控科技股份有限公司 Safety voting system and method based on code simulator
CN116881920B (en) * 2023-06-27 2024-03-26 北京城建智控科技股份有限公司 Safety voting system and method based on code simulator

Also Published As

Publication number Publication date
CN113127270B (en) 2023-06-27

Similar Documents

Publication Publication Date Title
CN113127270A (en) Cloud computing-based 2-out-of-3 safety computer platform
CN103199972B (en) The two-node cluster hot backup changing method realized based on SOA, RS485 bus and hot backup system
TW486637B (en) Method and apparatus for managing redundant computer-based systems for fault tolerant computing
CN102404390B (en) Intelligent dynamic load balancing method for high-speed real-time database
US8032786B2 (en) Information-processing equipment and system therefor with switching control for switchover operation
EP2224341B1 (en) Node system, server switching method, server device, and data transfer method
US9231779B2 (en) Redundant automation system
CN109936622A (en) A kind of unmanned plane cluster control method and system based on distributed resource sharing
CN107453913B (en) Gateway redundancy method with high-speed communication between processors
CN102045187B (en) Method and equipment for realizing HA (high-availability) system with checkpoints
CN105959145B (en) A kind of method and system for the concurrent management server being applicable in high availability cluster
CN105812161A (en) Controller fault backup method and system
WO2014060465A1 (en) Control system and method for supervisory control and data acquisition
CN109104325A (en) Train network data transmission method, system and its apparatus based on CANopen agreement
CN106027313B (en) Network link disaster tolerance system and method
CN110677288A (en) Edge computing system and method generally used for multi-scene deployment
CN114124803B (en) Device management method and device, electronic device and storage medium
CN115694748A (en) Redundancy framework design method based on real-time data synchronization of hierarchical system
WO2023007209A1 (en) Fault-tolerant distributed computing for vehicular systems
CN103929320A (en) Integration platform for IT system disaster recovery
Gohil et al. Redundancy management and synchronization in avionics communication products
CN113162735A (en) Enhanced signal control system and method based on general server
CN103580926B (en) A kind of light-weight hot standby system synchronization method
CN101453354A (en) High availability system based on ATCA architecture
CN100413261C (en) Method and system of data recovering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant