CN115599609A - Fault-tolerant method for multi-kernel operating system - Google Patents

Fault-tolerant method for multi-kernel operating system Download PDF

Info

Publication number
CN115599609A
CN115599609A CN202110776463.7A CN202110776463A CN115599609A CN 115599609 A CN115599609 A CN 115599609A CN 202110776463 A CN202110776463 A CN 202110776463A CN 115599609 A CN115599609 A CN 115599609A
Authority
CN
China
Prior art keywords
kernel
service process
operating system
address space
shadow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110776463.7A
Other languages
Chinese (zh)
Inventor
张为华
蒋金虎
林玉哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202110776463.7A priority Critical patent/CN115599609A/en
Publication of CN115599609A publication Critical patent/CN115599609A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/203Failover techniques using migration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2043Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant where the redundant components share a common memory address space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention provides a fault tolerance method facing a multi-kernel operating system, which is used for rapidly recovering services operated on a fault kernel, and the multi-kernel operating system comprises the following steps: the first kernel runs a service process, and the service process has a first user address space; the second kernel is at least operated with a user process; the memory is characterized by comprising the following steps: step S1, when a service process is created on a first kernel, a corresponding shadow service process is created on a second kernel, the shadow service process is provided with a second user address space, and the first user address space and the second user address space are mapped to the same memory; s2, when the service process modifies the content of the first user address space, the shadow service process modifies the content of the second user address space in the same way; and S3, when the first kernel is detected to be invalid, the shadow service process is moved into a work queue, and the user process is changed from the service access process to the shadow service access process.

Description

Fault-tolerant method for multi-kernel operating system
Technical Field
The invention belongs to the technical field of computer operating systems, and particularly relates to a fault tolerance method for a multi-kernel operating system.
Background
An operating system is a core component between computer hardware and applications, currently, the computing performance of a CPU of a computer mainly depends on the performance of a single CPU and the number of CPUs on a chip, and as the single CPU is limited by a physical device process, a multi-core system is more and more widely applied, and the number of cores included in the multi-core system is more and more. In order to efficiently run operating systems on multi-kernel systems and to flexibly use resources thereon, researchers have designed and implemented multi-kernel operating systems. Andrew Baumann et al designed a multi-core model, which allows different cores to run different core nodes, and programs running on The core nodes can interact through inter-process communication (see The Multikernel: A new OS architecture for scalable multi-core systems, SOSP' 09).
In the design of a multi-core operating system, when one of the cores fails or needs to be updated, a certain fault-tolerant mechanism is required in order to enable a program originally running on the core to continue to provide services. The predecessor has studied this to some extent, and people such as wanli invented a failure control method, designed a lightweight kernel, and realized fault tolerance by transferring system services from a faulty core to other normal cores (see "failure control method and apparatus of multi-kernel operating system", CN 104657240A). However, in this method, when the core in the heavy core fails, a new core in the heavy core needs to be determined again from the plurality of cores in the light core, system services are transferred, and corresponding core state information is updated in all the cores, so the mechanism is complex; meanwhile, the migration system services involve the duplication of state information and data, and therefore, have some impact on performance.
Disclosure of Invention
In order to solve the problems, the invention provides a fault tolerance method for a multi-kernel operating system, which adopts the following technical scheme:
the invention provides a fault tolerance method facing a multi-kernel operating system, which is used for rapidly recovering services running on a fault kernel of the multi-kernel operating system, and the multi-kernel operating system comprises the following steps: the first kernel runs a service process, and the service process has a first user address space; a second kernel running at least a user process; the memory is characterized by comprising the following steps: step S1, when a service process is created on a first kernel, a corresponding shadow service process is created on a second kernel, the shadow service process is provided with a second user address space, and the first user address space and the second user address space are mapped to the same memory; s2, when the service process modifies the content of the first user address space, the shadow service process modifies the content of the second user address space in the same way; and S3, when the first kernel is detected to be invalid, the second kernel moves the shadow service process into a work queue, so that the user process is changed from the service process to the shadow service process.
The fault tolerance method for the multi-kernel operating system provided by the invention can also have the technical characteristics that the multi-kernel operating system is a microkernel operating system.
The fault tolerance method for the multi-kernel operating system provided by the invention also has the technical characteristics that the service process also has a first thread control block; the shadow service process also has a second thread control block, and the first thread control block is in a work queue of the first kernel; the second thread control block is in the copy queue of the second core.
The fault tolerance method for the multi-core operating system provided by the invention can also have the technical characteristics that when the first core fails, the second core moves the second thread control block into the work queue.
The fault tolerance method for the multi-kernel operating system provided by the invention also has the technical characteristics that the user process accesses the service process and the shadow service process through interprocess communication.
Action and Effect of the invention
According to the fault tolerance method facing the multi-kernel operating system, shadow service processes corresponding to the service processes are prepared in advance, user address spaces of the service processes and the user address spaces of the service processes are mapped to the same internal memory, when the service processes interact with the kernel where the service processes are located and the content of the user address spaces of the service processes is modified, the shadow service processes modify the content of the user address spaces in the same mode, so that state information and data of the service processes and the user address spaces are kept consistent, the shadow service processes are in a copy queue and do not actually run, when the kernel where the service processes are located fails, the shadow service processes are placed in a work queue and become a normally running task, the user processes can call the shadow service processes directly through communication among original processes, therefore, the fault tolerance method can quickly replace original services running on a failed kernel through the shadow services, is simple in mechanism, occupies less system resources, does not need any copying process, and has less influence on performance.
Drawings
FIG. 1 is a schematic diagram of a service process and a shadow service process according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the operation of a multi-user usage service process according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating the operation of a shadow service process in case of failure according to an embodiment of the present invention.
Detailed Description
In order to make the technical means, creation features, achievement objects and effects of the present invention easy to understand, the fault tolerance method for the multi-kernel operating system of the present invention is specifically described below with reference to the embodiments and the accompanying drawings.
< example >
In this embodiment, the multi-kernel operating system is seL4, and runs on a multi-kernel hardware platform based on an x86 architecture.
FIG. 1 is a schematic diagram of the service process and shadow service process according to an embodiment of the present invention.
As shown in fig. 1, the multi-core operating system includes a first core 1, a second core 2, and a memory 3. Every two kernels in the multi-kernel operating system can be a first kernel 1 and a second kernel 2.
A service process 11 runs on the first kernel 1, and the service process 11 has a first user address space 12 and a first Thread Control Block (TCB) 13 on the first kernel 1.
The shadow service process 21 runs on the second kernel 2, the shadow service process 21 having a second user address space 22 and a second thread control block 23 on the second kernel 2.
The quick recovery of the failed service is performed by:
step S1, when the service process 11 is created on the first kernel 1, a corresponding shadow service process 21 is created on the second kernel 2, and the first user address space 12 and the second user address space 22 are mapped to the same memory 3, that is, to the same physical address space.
Step S2, when the service process 11 interacts with the first kernel 1 and modifies the content in the first user address space 12, the shadow service process 21 also interacts with the second kernel 2 and modifies the content in the second user address space 22 in the same way, so that the two sides are consistent on the memory data. The rights management and the data management are implemented by capabilities (capabilities) of the seL 4. At this time, the second thread control block 23 will not actually run in the copy queue of the second kernel 2, that is, the shadow service process 21 is in the copy queue, and will not modify the content of the user memory.
Figure 2 is a schematic diagram of the operation of a multi-user service using process according to an embodiment of the present invention.
As shown in fig. 2, a first user process 14 runs on a first kernel 1 and a second user process 24 runs on a second kernel 2. Both the first user process 14 and the second user process 24 access the service process 11 on the first kernel 1 through inter-process communication (IPC).
FIG. 3 is a diagram illustrating the operation of a shadow service process in case of failure according to an embodiment of the present invention.
Step S3, as shown in fig. 3, when the first kernel 1 fails and the service process 11 fails, and the system detects that the first kernel 1 fails, the second kernel 2 moves the second thread control 23 into the work queue of the second kernel 2, that is, the shadow service process 21 becomes a task that runs normally, and the state information and data of the shadow service process 21 are consistent with the service process 11 before failure. At this point, the first user process 14 on the first kernel 1 fails, and the second user process 24 on the second kernel 2 can access the shadow service process 21 through the exact same IPC call (e.g., using the same seL4 permissions), thereby enabling fast recovery of the failed service to the user process 24.
Examples effects and effects
According to the fault tolerance method for the multi-kernel operating system, shadow service processes corresponding to the service processes are prepared in advance, user address spaces of the service processes are mapped to the same memory, when the service processes interact with the kernels where the service processes are located and the content of the user address spaces of the service processes is modified, the shadow service processes also modify the content of the user address spaces in the same mode, so that state information and data of the service processes are kept consistent, the shadow service processes are in the copy queues and do not run actually, when the kernels where the service processes are located fail, the shadow service processes are placed in the work queues to become a normally running task, the user processes can call the shadow service processes to directly access the shadow service processes through communication between the original processes, therefore, the fault tolerance method can quickly replace the original services running on the failed kernels through the shadow services, is simple in mechanism, occupies less system resources, does not need any copying process, and has less influence on performance.
The above-described embodiments are merely illustrative of specific embodiments of the present invention, and the present invention is not limited to the description of the above-described embodiments.
In the above embodiment, the multi-kernel operating system is seL4, and in other embodiments of the present invention, other micro-kernel operating systems, such as Minix, may also be used to achieve the technical effects of the present invention.
In the above embodiment, one corresponding shadow service process is created for one service process, and in other solutions of the present invention, a plurality of corresponding shadow service processes may also be created for one service process on a plurality of other kernels, and the technical effects of the present invention can also be achieved.
In the above embodiment, one corresponding shadow service process is created for one service process, and in other embodiments of the present invention, corresponding shadow service processes may also be created for a plurality of service processes, respectively, so as to achieve the technical effects of the present invention.

Claims (5)

1. A fault tolerance method for a multi-kernel operating system is used for rapidly recovering services running on a fault kernel of the multi-kernel operating system, and the multi-kernel operating system comprises the following steps:
the system comprises a first kernel, a second kernel and a third kernel, wherein the first kernel runs a service process which has a first user address space on the first kernel;
the second kernel is at least operated with a user process; and
the memory is used for storing the data of the computer,
the fault tolerance method for the multi-kernel operating system is characterized by comprising the following steps of:
step S1, when the service process is created on the first kernel, a corresponding shadow service process is created on the second kernel, the shadow service process has a second user address space on the second kernel, and the first user address space and the second user address space are mapped to the same memory;
s2, when the service process modifies the content of the first user address space, the shadow service process modifies the content of the second user address space in the same way;
and S3, when the first kernel is detected to be invalid, the second kernel moves the shadow service process into a work queue, so that the user process is changed from accessing the service process to accessing the shadow service process.
2. The multi-kernel operating system oriented fault-tolerant method of claim 1, characterized in that:
wherein, the multi-kernel operating system is a microkernel operating system.
3. The fault tolerant method of kernel operating systems as claimed in claim 2, wherein:
wherein the service process further has a first thread control block;
the shadow service process also has a second thread control block,
the first thread control block is in the work queue of the first core;
the second thread control block is in a copy queue of the second core.
4. A fault tolerant method of a kernel operating system as claimed in claim 3 wherein:
wherein the second core moves the second thread control block into the work queue when the first core fails.
5. The fault tolerant method of a kernel operating system according to claim 1, wherein:
wherein the user process accesses the service process and the shadow service process through inter-process communication.
CN202110776463.7A 2021-07-09 2021-07-09 Fault-tolerant method for multi-kernel operating system Pending CN115599609A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110776463.7A CN115599609A (en) 2021-07-09 2021-07-09 Fault-tolerant method for multi-kernel operating system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110776463.7A CN115599609A (en) 2021-07-09 2021-07-09 Fault-tolerant method for multi-kernel operating system

Publications (1)

Publication Number Publication Date
CN115599609A true CN115599609A (en) 2023-01-13

Family

ID=84841411

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110776463.7A Pending CN115599609A (en) 2021-07-09 2021-07-09 Fault-tolerant method for multi-kernel operating system

Country Status (1)

Country Link
CN (1) CN115599609A (en)

Similar Documents

Publication Publication Date Title
US9519795B2 (en) Interconnect partition binding API, allocation and management of application-specific partitions
US9684545B2 (en) Distributed and continuous computing in a fabric environment
US5802265A (en) Transparent fault tolerant computer system
US8020041B2 (en) Method and computer system for making a computer have high availability
US7523344B2 (en) Method and apparatus for facilitating process migration
CN109614276B (en) Fault processing method and device, distributed storage system and storage medium
US9378069B2 (en) Lock spin wait operation for multi-threaded applications in a multi-core computing environment
Welsh et al. A design framework for highly concurrent systems
US20120054409A1 (en) Application triggered state migration via hypervisor
CN108932154B (en) Distributed virtual machine manager
JPH11161624A (en) Virtually reliable common memory
WO1997022930A9 (en) Transparent fault tolerant computer system
Rieker et al. Transparent user-level checkpointing for the native posix thread library for linux.
Chakravorty et al. Proactive fault tolerance in large systems
US9477518B1 (en) Method to automatically redirect SRB routines to a zIIP eligible enclave
CN106445691A (en) Memory optimization method oriented to virtual cloud computing platform
US11093332B2 (en) Application checkpoint and recovery system
WO2006028521A1 (en) Process checkpointing and migration in computing systems
CN115599609A (en) Fault-tolerant method for multi-kernel operating system
Rosenblum et al. Implementing efficient fault containment for multiprocessors: confining faults in a shared-memory multiprocessor environment
US11003488B2 (en) Memory-fabric-based processor context switching system
Im et al. On-demand Virtualization for Post-copy OS Migration in Bare-metal Cloud
US10712952B1 (en) Metadata caches in a reliable distributed computing system
Kato et al. K Computer
Craddock et al. The Case for Physical Memory Pools: A Vision Paper

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination