CN115599609A

CN115599609A - Fault-tolerant method for multi-kernel operating system

Info

Publication number: CN115599609A
Application number: CN202110776463.7A
Authority: CN
Inventors: 张为华; 蒋金虎; 林玉哲
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2021-07-09
Filing date: 2021-07-09
Publication date: 2023-01-13

Abstract

The invention provides a fault tolerance method facing a multi-kernel operating system, which is used for rapidly recovering services operated on a fault kernel, and the multi-kernel operating system comprises the following steps: the first kernel runs a service process, and the service process has a first user address space; the second kernel is at least operated with a user process; the memory is characterized by comprising the following steps: step S1, when a service process is created on a first kernel, a corresponding shadow service process is created on a second kernel, the shadow service process is provided with a second user address space, and the first user address space and the second user address space are mapped to the same memory; s2, when the service process modifies the content of the first user address space, the shadow service process modifies the content of the second user address space in the same way; and S3, when the first kernel is detected to be invalid, the shadow service process is moved into a work queue, and the user process is changed from the service access process to the shadow service access process.

Description

Fault-tolerant method for multi-kernel operating system

Technical Field

The invention belongs to the technical field of computer operating systems, and particularly relates to a fault tolerance method for a multi-kernel operating system.

Background

An operating system is a core component between computer hardware and applications, currently, the computing performance of a CPU of a computer mainly depends on the performance of a single CPU and the number of CPUs on a chip, and as the single CPU is limited by a physical device process, a multi-core system is more and more widely applied, and the number of cores included in the multi-core system is more and more. In order to efficiently run operating systems on multi-kernel systems and to flexibly use resources thereon, researchers have designed and implemented multi-kernel operating systems. Andrew Baumann et al designed a multi-core model, which allows different cores to run different core nodes, and programs running on The core nodes can interact through inter-process communication (see The Multikernel: A new OS architecture for scalable multi-core systems, SOSP' 09).

In the design of a multi-core operating system, when one of the cores fails or needs to be updated, a certain fault-tolerant mechanism is required in order to enable a program originally running on the core to continue to provide services. The predecessor has studied this to some extent, and people such as wanli invented a failure control method, designed a lightweight kernel, and realized fault tolerance by transferring system services from a faulty core to other normal cores (see "failure control method and apparatus of multi-kernel operating system", CN 104657240A). However, in this method, when the core in the heavy core fails, a new core in the heavy core needs to be determined again from the plurality of cores in the light core, system services are transferred, and corresponding core state information is updated in all the cores, so the mechanism is complex; meanwhile, the migration system services involve the duplication of state information and data, and therefore, have some impact on performance.

Disclosure of Invention

In order to solve the problems, the invention provides a fault tolerance method for a multi-kernel operating system, which adopts the following technical scheme:

the invention provides a fault tolerance method facing a multi-kernel operating system, which is used for rapidly recovering services running on a fault kernel of the multi-kernel operating system, and the multi-kernel operating system comprises the following steps: the first kernel runs a service process, and the service process has a first user address space; a second kernel running at least a user process; the memory is characterized by comprising the following steps: step S1, when a service process is created on a first kernel, a corresponding shadow service process is created on a second kernel, the shadow service process is provided with a second user address space, and the first user address space and the second user address space are mapped to the same memory; s2, when the service process modifies the content of the first user address space, the shadow service process modifies the content of the second user address space in the same way; and S3, when the first kernel is detected to be invalid, the second kernel moves the shadow service process into a work queue, so that the user process is changed from the service process to the shadow service process.

The fault tolerance method for the multi-kernel operating system provided by the invention can also have the technical characteristics that the multi-kernel operating system is a microkernel operating system.

The fault tolerance method for the multi-kernel operating system provided by the invention also has the technical characteristics that the service process also has a first thread control block; the shadow service process also has a second thread control block, and the first thread control block is in a work queue of the first kernel; the second thread control block is in the copy queue of the second core.

The fault tolerance method for the multi-core operating system provided by the invention can also have the technical characteristics that when the first core fails, the second core moves the second thread control block into the work queue.

The fault tolerance method for the multi-kernel operating system provided by the invention also has the technical characteristics that the user process accesses the service process and the shadow service process through interprocess communication.

Action and Effect of the invention

According to the fault tolerance method facing the multi-kernel operating system, shadow service processes corresponding to the service processes are prepared in advance, user address spaces of the service processes and the user address spaces of the service processes are mapped to the same internal memory, when the service processes interact with the kernel where the service processes are located and the content of the user address spaces of the service processes is modified, the shadow service processes modify the content of the user address spaces in the same mode, so that state information and data of the service processes and the user address spaces are kept consistent, the shadow service processes are in a copy queue and do not actually run, when the kernel where the service processes are located fails, the shadow service processes are placed in a work queue and become a normally running task, the user processes can call the shadow service processes directly through communication among original processes, therefore, the fault tolerance method can quickly replace original services running on a failed kernel through the shadow services, is simple in mechanism, occupies less system resources, does not need any copying process, and has less influence on performance.

Drawings

FIG. 1 is a schematic diagram of a service process and a shadow service process according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the operation of a multi-user usage service process according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating the operation of a shadow service process in case of failure according to an embodiment of the present invention.

Detailed Description

In order to make the technical means, creation features, achievement objects and effects of the present invention easy to understand, the fault tolerance method for the multi-kernel operating system of the present invention is specifically described below with reference to the embodiments and the accompanying drawings.

< example >

In this embodiment, the multi-kernel operating system is seL4, and runs on a multi-kernel hardware platform based on an x86 architecture.

FIG. 1 is a schematic diagram of the service process and shadow service process according to an embodiment of the present invention.

As shown in fig. 1, the multi-core operating system includes a first core 1, a second core 2, and a memory 3. Every two kernels in the multi-kernel operating system can be a first kernel 1 and a second kernel 2.

A service process 11 runs on the first kernel 1, and the service process 11 has a first user address space 12 and a first Thread Control Block (TCB) 13 on the first kernel 1.

The shadow service process 21 runs on the second kernel 2, the shadow service process 21 having a second user address space 22 and a second thread control block 23 on the second kernel 2.

The quick recovery of the failed service is performed by:

step S1, when the service process 11 is created on the first kernel 1, a corresponding shadow service process 21 is created on the second kernel 2, and the first user address space 12 and the second user address space 22 are mapped to the same memory 3, that is, to the same physical address space.

Step S2, when the service process 11 interacts with the first kernel 1 and modifies the content in the first user address space 12, the shadow service process 21 also interacts with the second kernel 2 and modifies the content in the second user address space 22 in the same way, so that the two sides are consistent on the memory data. The rights management and the data management are implemented by capabilities (capabilities) of the seL 4. At this time, the second thread control block 23 will not actually run in the copy queue of the second kernel 2, that is, the shadow service process 21 is in the copy queue, and will not modify the content of the user memory.

Figure 2 is a schematic diagram of the operation of a multi-user service using process according to an embodiment of the present invention.

As shown in fig. 2, a first user process 14 runs on a first kernel 1 and a second user process 24 runs on a second kernel 2. Both the first user process 14 and the second user process 24 access the service process 11 on the first kernel 1 through inter-process communication (IPC).

Step S3, as shown in fig. 3, when the first kernel 1 fails and the service process 11 fails, and the system detects that the first kernel 1 fails, the second kernel 2 moves the second thread control 23 into the work queue of the second kernel 2, that is, the shadow service process 21 becomes a task that runs normally, and the state information and data of the shadow service process 21 are consistent with the service process 11 before failure. At this point, the first user process 14 on the first kernel 1 fails, and the second user process 24 on the second kernel 2 can access the shadow service process 21 through the exact same IPC call (e.g., using the same seL4 permissions), thereby enabling fast recovery of the failed service to the user process 24.

Examples effects and effects

According to the fault tolerance method for the multi-kernel operating system, shadow service processes corresponding to the service processes are prepared in advance, user address spaces of the service processes are mapped to the same memory, when the service processes interact with the kernels where the service processes are located and the content of the user address spaces of the service processes is modified, the shadow service processes also modify the content of the user address spaces in the same mode, so that state information and data of the service processes are kept consistent, the shadow service processes are in the copy queues and do not run actually, when the kernels where the service processes are located fail, the shadow service processes are placed in the work queues to become a normally running task, the user processes can call the shadow service processes to directly access the shadow service processes through communication between the original processes, therefore, the fault tolerance method can quickly replace the original services running on the failed kernels through the shadow services, is simple in mechanism, occupies less system resources, does not need any copying process, and has less influence on performance.

The above-described embodiments are merely illustrative of specific embodiments of the present invention, and the present invention is not limited to the description of the above-described embodiments.

In the above embodiment, the multi-kernel operating system is seL4, and in other embodiments of the present invention, other micro-kernel operating systems, such as Minix, may also be used to achieve the technical effects of the present invention.

In the above embodiment, one corresponding shadow service process is created for one service process, and in other solutions of the present invention, a plurality of corresponding shadow service processes may also be created for one service process on a plurality of other kernels, and the technical effects of the present invention can also be achieved.

In the above embodiment, one corresponding shadow service process is created for one service process, and in other embodiments of the present invention, corresponding shadow service processes may also be created for a plurality of service processes, respectively, so as to achieve the technical effects of the present invention.

Claims

1. A fault tolerance method for a multi-kernel operating system is used for rapidly recovering services running on a fault kernel of the multi-kernel operating system, and the multi-kernel operating system comprises the following steps:

the system comprises a first kernel, a second kernel and a third kernel, wherein the first kernel runs a service process which has a first user address space on the first kernel;

the second kernel is at least operated with a user process; and

the memory is used for storing the data of the computer,

the fault tolerance method for the multi-kernel operating system is characterized by comprising the following steps of:

step S1, when the service process is created on the first kernel, a corresponding shadow service process is created on the second kernel, the shadow service process has a second user address space on the second kernel, and the first user address space and the second user address space are mapped to the same memory;

s2, when the service process modifies the content of the first user address space, the shadow service process modifies the content of the second user address space in the same way;

and S3, when the first kernel is detected to be invalid, the second kernel moves the shadow service process into a work queue, so that the user process is changed from accessing the service process to accessing the shadow service process.

2. The multi-kernel operating system oriented fault-tolerant method of claim 1, characterized in that:

wherein, the multi-kernel operating system is a microkernel operating system.

3. The fault tolerant method of kernel operating systems as claimed in claim 2, wherein:

wherein the service process further has a first thread control block;

the shadow service process also has a second thread control block,

the first thread control block is in the work queue of the first core;

the second thread control block is in a copy queue of the second core.

4. A fault tolerant method of a kernel operating system as claimed in claim 3 wherein:

wherein the second core moves the second thread control block into the work queue when the first core fails.

5. The fault tolerant method of a kernel operating system according to claim 1, wherein:

wherein the user process accesses the service process and the shadow service process through inter-process communication.