CN116700901A

CN116700901A - Container construction and operation system and method based on microkernel

Info

Publication number: CN116700901A
Application number: CN202310746321.5A
Authority: CN
Inventors: 糜泽羽; 岑少锋; 陈海波; 臧斌宇
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2023-06-21
Filing date: 2023-06-21
Publication date: 2023-09-05

Abstract

The invention provides a container construction and operation system and method based on microkernel, comprising: a namespace function module, a control group function module, and a fault recovery function module; wherein the namespace function module: for partitioning static system resources; dividing mounting point data, network protocol stack data and process management data, accessing different system resources by different namespaces where different application containers are located, and realizing isolation of the system resources; control group function module: the system resource control method is used for dividing dynamic system resources, counting and limiting CPU resources, memory resources and I/O bandwidth resources, using limited resources in different control groups where different application containers are located, and realizing the limitation of the system resources; fault recovery function module: for handling the case of a system service crash due to a memory error. The invention can obtain higher isolation and safety, and can also obtain improvement on performance.

Description

Container construction and operation system and method based on microkernel

Technical Field

The invention relates to the technical field of virtualization, in particular to a container construction and operation system and method based on microkernels.

Background

The container may provide a separate environment for the execution of the operating system and applications, and its use has become more and more common in recent years, as well as the importance of container security has increased. Isolation of containers is an important aspect of security, and a container design with high isolation will not affect other containers and operating systems when one container fails. Isolation of the container is critical to support reliability and scalability of the container. However, with the continued development of operating systems, the isolation mechanisms of conventional containers face new problems and challenges. For example, the code amounts of the operating system and the software are expanding continuously, so that a large number of undiscovered vulnerabilities exist in the application program, the vulnerabilities are easy to be utilized by malicious containers, and the data of other containers running on the same kernel can be stolen by a container with a fault, or resources belonging to other containers can be occupied. Researchers in industry and academia have proposed a series of technical solutions to address these problems to enhance isolation of containers, including using new software architecture isolation planes to increase isolation, enhancing isolation of traditional operating systems, using new hardware to increase isolation, using strongly isolated operating systems, etc.

When an application program is run in a traditional Linux operating system, because the process of the operating system is not strong in isolation, resources such as a network, a CPU, a memory, a disk and the like can be shared among different processes, and malicious processes can easily attack data streams of other processes to steal data of other processes. A process with high security requirements requires an additional protection mechanism to isolate itself from other processes. Linux provides a Namespace (Namespace) and Control Groups (Control Groups) mechanism to help processes achieve isolated functionality. The namespace mechanism provides a resource isolation scheme in which each namespace can specify its own system resources of processes, inter-process communications, networks, files, etc., which are not visible to the remaining namespaces. By placing different processes into different namespaces, the resource isolation among the processes can be realized, so that malicious processes cannot directly steal data of other processes. The control group mechanism can limit the amount of resources such as network, CPU, memory, disk and the like used by each process, and can prevent a certain process from consuming excessive system resources to influence the performance of other processes.

The container is generated based on a namespace mechanism and a control group mechanism, which can provide an isolated execution environment for processes running within the container. LXC is a typical container. LXC is shorthand for Linux Container, which is a Container for Linux native support. LXC sees each namespace as one container, with each container having its own view of processes, inter-process communications, file systems, and network views, etc. Because of the use of namespaces for isolation, different containers cannot access data to each other and cannot affect control flows executed in other containers. Its architecture is shown in fig. 14. In LXC, an operating system and corresponding applications run in each container. In order to prevent a certain container from occupying excessive system resources and affecting the performance of other containers and kernels, LXC uses a control group mechanism to limit the upper limit of system resources that can be used by a process group running in each namespace, i.e., limit the amount of resources such as network, CPU, memory, disk, etc. that can be occupied by each container.

Although the container enhances its isolation using the namespace and control group mechanisms, it still has two major isolation issues: safety isolation and performance isolation.

Security quarantine refers to whether one container has access to data of other containers or affects the security of other containers. The container is more secure than normal processes, but it still has security isolation problems. This is because the containers, although isolated by the namespace and control group mechanisms, are processes running on the same operating system, the containers must trust the entire operating system to function properly, with the operating system's code as its Trusted Code Base (TCB). The code quantity of the operating system is rapidly expanded, the code quantity of the Linux kernel is increased from 15 ten thousand lines of codes of v1.0 edition to more than 2000 ten thousand lines of codes of 4.15 edition, and the expansion of the TCB is more than 100 times, so that a large number of hidden holes are inevitably contained in the Linux. If a container triggers a vulnerability in the operating system, it may have access to or attack other containers on the operating system and even the operating system itself. For example, a published sharer attack on a Docker platform exploits a design hole in a system call associated with a file handle to allow a container to access any file on an operating system, including files of other containers, and the security isolation of the container is compromised.

Performance isolation refers to whether one container can affect the performance of the other container. In LXC and Docker, one container defaults to occupy all CPU resources. If the deployer does not manually set the upper limit of CPU resources that each container can use, when running a CPU high-load program in one container, the performance of the other container is affected. When the CPU is heavily occupied, the efficiency of executing transactions by the database running within the LXC may be reduced by about 10%. In addition, since a large number of holes are hidden in the operating system, one container can attack the security of other containers or the operating system through the holes, and can also utilize the holes to enable the operating system to allocate more resources to the operating system, so that the resources which originally belong to the other containers or the operating system are invaded, and the performance of the other containers is influenced. The data indicates that the efficiency of a database running in an LXC will be reduced by about 15% when the disk I/O resources are full. When the memory resource is full, its efficiency may be reduced by approximately 30%.

It can be seen that the conventional container, although enhancing the isolation of the process, still has safety isolation problems and performance isolation problems, and has a great room for improvement.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a container construction and operation system and method based on microkernel.

According to the container construction and operation system and method based on microkernel provided by the invention, the scheme is as follows:

in a first aspect, a microkernel-based container build and run system is provided, the system comprising: a namespace function module, a control group function module, and a fault recovery function module;

wherein the namespace function module: for partitioning static system resources; dividing mounting point data, network protocol stack data and process management data, accessing different system resources by different namespaces where different application containers are located, and realizing isolation of the system resources;

control group function module: the system resource control method is used for dividing dynamic system resources, counting and limiting CPU resources, memory resources and I/O bandwidth resources, using limited resources in different control groups where different application containers are located, and realizing the limitation of the system resources;

fault recovery function module: the method is used for processing the condition of system service breakdown caused by memory errors;

the system also includes a system container to enhance resource statistics and failure recovery for the system service processes.

Preferably, the namespace function module includes: creating, joining, exiting, destroying and using a file system mounting point name space;

The file system mount point namespaces are created as follows:

step 1): initiating a creation request, sending a request to a system container by a container management program in an inter-process communication mode, and creating a naming space of a file system mounting point;

step 2): acquiring a working path sent by a container management program, and judging whether the path length is smaller than 255 characters;

step 3): acquiring mounting point information of a working path in an original state, and acquiring mounting point information from the mounting point information in the original state according to the working path;

step 4): sending a request for creating a namespace to a corresponding file system, wherein a specific flow is introduced in the creation of the namespace of the file system;

step 5): creating a new mounting point information linked list;

step 6): initializing a new mounting point information linked list, and adding the mounting point information acquired in the step 3) into the linked list as mounting point information of a path;

step 7): acquiring available namespaces ID from the array;

step 8): adding a new mounting point information linked list into an array element of a system container;

step 9): the inter-process communication returns to complete the creation of the new namespace.

Preferably, the step of joining the file system mount point namespace is as follows:

Step 1): initiating a request for joining a name space, wherein a container management program needs to transmit an ID of a name space of a system container of a mounting point and a system service type into a kernel;

step 2): acquiring a system container name space ID and a system service type of a mounting point system, which are transmitted by a container management program;

step 3): filling the name space ID of the mount point system container, which is transmitted by the container management program, into the position of the name space of the mount point system container of the application container process;

step 4): and finishing the namespace joining request, and returning to the user mode from the kernel mode.

Preferably, the step of exiting the file system mount point namespace is as follows:

step 1): initiating a request for exiting a namespace, wherein a container management program needs to transmit the system service type of the system container of the mounting point into a kernel;

step 2): acquiring a system service type transmitted by a container management program, namely a system service type of a system container of a mounting point system;

step 3): clearing up the namespaces corresponding to the system containers of the mounting points in the kernel;

step 4): and finishing the namespace exit request, and returning to the user mode from the kernel mode.

Preferably, the step of destroying the file system mount point namespaces is as follows:

Step 1): the method comprises the steps that a destruction request is initiated, a container management program sends a request to a system container in an inter-process communication mode, and a naming space of a file system mounting point is destroyed;

step 2): acquiring a name space ID sent by a container management program;

step 3): judging whether the name space ID is 0, if so, representing a root name space, and failing to destroy;

step 4): judging whether a name space corresponding to the name space ID exists or not, and if the name space exists, failing to destroy the name space;

step 5): acquiring mounting point information of a path;

step 6): sending a request for destroying the name space to a corresponding file system;

step 7): clearing a current mounting point information linked list, wherein the current mounting point information linked list comprises a memory release and an assignment pointer of 0;

step 8): clearing corresponding name space elements in the array of the system container;

step 9): and returning the inter-process communication request to complete the destruction of the naming space.

Preferably, the step of using the file system mount point namespace is as follows:

step 1): the application container initiates a request and sends an inter-process communication request to the mounting point system service;

step 2): acquiring a name space ID of a current application container in a kernel mode;

step 3): acquiring a corresponding system container from a structure body related to the inter-process communication request;

Step 4): switching the process to a system container process and transmitting a namespace ID;

step 5): acquiring a name space ID transferred by a kernel;

step 6): searching the mounting point linked list information of the corresponding name space from the array according to the name space ID and switching;

step 7): executing a specific application container request;

step 8): after the request is processed, returning to a kernel mode;

step 9): and finishing the request and returning to the application container.

Preferably, the control group function module includes:

statistics and limitation of CPU resources: taking the total number of clock interruption received by a process as the CPU resource usage amount; when the clock interrupt is triggered, a process running on a CPU (central processing unit) which triggers the clock interrupt at present is acquired by a kernel, and the number of the clock interrupts received by the process is added with 1;

modifying the scheduling strategy, entering the scheduling strategy after the time slice of the current process is 0, taking out a process from the waiting queue, calculating the CPU utilization rate, namely the ratio of the number of clock interrupts received by the process to the sum of the number of clock interrupts received by all processes in the waiting queue, and comparing the actual CPU utilization rate with the CPU utilization rate set by a user;

if the CPU utilization rate of the process is higher, the time slice is reduced, and even the process is in a waiting state for a long time; if the CPU utilization rate of the process is lower, the time slice is enlarged;

Independently controlling the time slices of each process through a modified scheduling strategy, and ensuring that the CPU utilization rate of each process accords with a value set by a user when each process completely schedules one round;

counting and limiting memory resources: capturing the abnormal occurrence of the page fault abnormality, adding the size of the allocated physical page into the corresponding application container process and application container, and counting the use amount of the memory resource of the physical page;

if the physical memory usage of the application container exceeds the value set by the user in the process, the application container process exceeding the memory usage is killed or the operation of the application container process is paused;

statistics and limiting I/O bandwidth resources: by adding a current limiter system service between the system service of the file system and the system service of the device driver, each I/O request is captured by the current limiter system service, whether the request issuing requirement is met or not is judged according to the type and the size of the I/O request, if not, the issuing of the request is suspended, and if yes, the request is issued, and the token number in the current limiter system service is updated;

the number of tokens increases gradually as the system operates and there is an upper limit on the number of tokens, once the upper limit is reached, the number of tokens cannot continue to increase.

Preferably, the fault recovery function module includes: when a fault error occurs, capturing the fault, wherein the most common fault is page fault abnormality, and after capturing, performing exit operation on the abnormal system container process, and recovering all resources;

and then sending a message for restarting the system container to the process management system container, and immediately restarting the system container after the process management system container receives the message, and reconstructing the content and inter-process communication in the system container.

Preferably, the system service process of the system container in the user state is also managed as a container, the CPU and memory overhead of the system service process are counted, and the system service process is uniformly managed;

the system accelerates the I/O speed of the microkernel through direct memory access, manages data to be transferred by adopting capability, and directly copies the data from the device to the memory or directly writes the data from the memory to the device through direct memory access, thereby reducing the number of inter-process communication and the number of repeated memory copies.

In a second aspect, a microkernel-based container construction and operation method is provided, the method comprising:

namespaces function steps: dividing static system resources; dividing mounting point data, network protocol stack data and process management data, accessing different system resources by different namespaces where different application containers are located, and realizing isolation of the system resources;

Control group function steps: dividing dynamic system resources, counting and limiting CPU resources, memory resources and I/O bandwidth resources, using limited resources by different control groups where different application containers are located, and realizing limitation of the system resources;

fault recovery function step: and handling the situation of system service breakdown caused by memory errors.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention proposes the concept of a system container, wherein the system container represents that programs operated by more than one user can be contained in the container, and all system services can be operated in the container, so that more accurate statistics on the use of system resources can be ensured;

2. the invention uses the direct memory access mode to improve the performance of microkernel, and skips the repeated memory copy and inter-process communication;

3. the invention can provide flexible container support in microkernel environment, can flexibly migrate to each microkernel platform without the environmental support of a specific macrokernel, can further support the real-time requirement in the specific container through flexible interrupt isolation and scheduling algorithm independence, and the used system container can lead the resource statistics to be more accurate, independently manage the behavior of the system service and realize stronger resource isolation capability;

4. Compared with the prior art, the invention can obtain performance improvement on the basis of obtaining stronger isolation and safety.

Other advantages of the present invention will be set forth in the description of specific technical features and solutions, by which those skilled in the art should understand the advantages that the technical features and solutions bring.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

FIG. 1 is a schematic diagram of the present invention;

FIG. 2 is a diagram of a namespace design architecture in accordance with the present invention;

FIG. 3 is a schematic diagram of a creation process of a file system mount point namespace in the present invention;

FIG. 4 is a schematic diagram of a file system mount point namespace joining process according to the present invention;

FIG. 5 is a diagram illustrating an exit procedure of a file system mount point namespace in accordance with the present invention;

FIG. 6 is a schematic diagram of a destroying procedure of a file system mount point namespace in the present invention;

FIG. 7 is a diagram illustrating a usage flow of a file system mount point namespace in accordance with the present invention;

FIG. 8 is a schematic diagram of a control group design architecture according to the present invention;

FIG. 9 is a schematic diagram of a fault recovery process flow in the present invention;

FIG. 10 is a flow chart of the direct memory access I/O of the present invention;

FIG. 11 is a diagram of CPU resource statistics;

FIG. 12 is a memory control group processing flow;

FIG. 13 is a flow of I/O control group processing;

fig. 14 is an LXC architecture.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.

The embodiment of the invention provides a container constructing and operating system based on microkernels, which is shown by referring to fig. 1 and comprises a name space function module, a control group function module and a fault recovery function module.

The invention also innovatively provides a system container to strengthen resource statistics and fault recovery of the system service process, and provides a mode of accelerating microkernel I/O by direct memory access.

1. Namespace function module: referring to FIG. 2, for partitioning static system resources; the method comprises the steps of dividing mounting point data, network protocol stack data and process management data, accessing different system resources by different namespaces where different application containers are located, and realizing isolation of the system resources.

The namespaces are used in a manner that mainly includes creation, joining, exiting, destroying, and use.

Creation of namespaces is done in user state, including retrieval of user requests, creation and initialization of system resources, and the like. Different namespaces perform different system resource creation and initialization flows.

For the file system mount point namespaces, the working paths input by users, creation and initialization of mount point information linked lists and the like need to be acquired.

For the LwIP network protocol stack namespaces, a new contiguous array needs to be created to store network data, and a new linked list needs to be created to store the network interfaces.

For a process management namespace, process nodes in the process tree need to be selected as new root nodes as accessed root nodes.

The addition of the namespaces is completed in the kernel state, the container management program transmits the corresponding namespace ID and the type of the corresponding system container into the kernel state, and fills the data into the corresponding position in the kernel state, so that the addition of the namespaces is completed.

The exit of the name space is completed in the kernel mode, the container management program transmits the corresponding system container type into the kernel mode, and the kernel mode deletes the data in the corresponding position, namely the exit of the name space is completed.

The destruction of namespaces is done in user state, and system resources created when namespaces are created need to be destroyed.

The use of namespaces depends on inter-process communication, when the inter-process communication needs to be sent, the ID of the corresponding namespaces is acquired in a kernel mode, then the ID is transferred to an upper system container, and the corresponding system resources are selected by the system container to be used.

Specifically, referring to FIG. 3, the steps for creating a file system mount point namespace are as follows:

Step 5): creating a new mounting point information linked list;

step 7): acquiring available namespaces ID from the array;

Referring to FIG. 4, the steps of joining a file system mount point namespace are as follows:

Referring to FIG. 5, the steps of exiting the file system mount point namespace are as follows:

Referring to FIG. 6, the steps for destroying the file system mount point namespaces are as follows:

step 2): acquiring a name space ID sent by a container management program;

step 5): acquiring mounting point information of a path;

step 6): sending a request for destroying the namespaces to the corresponding file system, wherein a specific flow is introduced in the namespace destruction of the file system;

Referring to FIG. 7, the steps for using the file system mount point namespaces are as follows:

step 2): acquiring the name space ID of the current application container in the kernel mode, wherein the process of adding the name space is written with the ID of the name space at the corresponding position;

step 5): acquiring a name space ID transferred by a kernel;

step 7): executing a specific application container request;

step 8): after the request is processed, returning to a kernel mode;

step 9): and finishing the request and returning to the application container.

2. Control group function module: referring to fig. 8, the method is used for dividing dynamic system resources, and can count and limit CPU resources, memory resources and I/O bandwidth resources, and different control groups where different application containers are located, and use resources with certain limits to realize the limitation of system resources.

The use of control groups mainly involves resource statistics and resource limitations.

The way CPU resources are counted and limited is as follows:

the statistical mode of CPU resource is to use the total number of clock interruption received by the process as CPU resource usage. Because the clock interrupt can be triggered according to a fixed time interval, when the clock interrupt is triggered, a process running on the CPU which triggers the clock interrupt currently can be acquired in the kernel, and the 1 adding operation is performed on the number of the clock interrupts received by the process.

In order to achieve control of process CPU utilization, in addition to the statistics of CPU resources mentioned above, a modification of the scheduling policy is required.

And after the time slice of the current process is 0, entering a scheduling strategy. And taking out a process from the waiting queue, calculating the CPU utilization rate, namely the ratio of the number of clock interrupts received by the process to the sum of the number of clock interrupts received by all processes in the waiting queue, and comparing the actual CPU utilization rate with the CPU utilization rate set by a user.

If the CPU utilization rate of the process is higher, the time slice is reduced, and even the process is in a waiting state for a long time; if the CPU usage of the process is low, the time slices are scaled up.

Through the modified scheduling strategy, the time slices of each process can be independently controlled, and when each process is completely scheduled for one round, the CPU utilization rate of each process can be ensured to meet the value set by a user.

The manner of counting and limiting memory resources is as follows:

and capturing the abnormal occurrence of the page fault abnormal, and adding the allocated physical page size into the corresponding application container process and application container, so as to count the use amount of the memory resource of the physical page.

If it is found in the process that the physical memory usage of the application container exceeds the user-set value, it is necessary to kill the application container process exceeding the memory usage or suspend the operation of the application container process.

The resource statistics method of the memory control group is to make statistics when the kernel allocates physical memory pages. And (3) counting physical memory data in the functions get_pages and free_pages, and adding the data into the currently running process.

In order to limit the application container to use the physical memory, judgment needs to be performed when the physical memory is counted each time, the judgment of memory statistics takes a process group as a unit, after the memory data of each application container process is processed, corresponding data is updated into the process group, and whether the memory of the process group exceeds a preset value of a user is judged.

If the actual physical memory usage of the process group exceeds the user's preset value, there are two options.

1. Directly killing the process in the process group exceeding the memory, and releasing the memory of the process;

2. the process keeps a waiting state until the process in the process group releases enough memory to wait;

the manner in which I/O bandwidth resources are counted and limited is as follows:

the I/O control group uses a current limiter to limit I/O bandwidth, intercepts and analyzes all I/O requests sent to a device driving system container to obtain corresponding requests and sizes, and counts the requests. The logic that the restrictor controls the bandwidth is implemented primarily by means of token bucket algorithms. The principle of the token bucket algorithm is to maintain a token bucket that generates tokens at a constant rate, the token bucket having a rated capacity. When the token bucket is full, the newly generated tokens cannot continue to join the token bucket.

When a request arrives, the type and size of the request are resolved, a token bucket is selected according to the type, whether the number of tokens in the token bucket is enough or not is checked, if the number of tokens is enough, the request is issued, and if the number of tokens is insufficient, the request is blocked until the number of tokens is enough.

The number of read-write times and the number of read-write bytes in a period of time can be limited by a token bucket algorithm.

To generate tokens at a constant rate, a timer is also maintained in the restrictor, running once per second, generating a certain amount of tokens, and checking if there is a blocked I/O request that has met the issue requirements.

Blocking of I/O requests is accomplished using a notification mechanism. When the number of tokens is found to be insufficient, a notification capability is created and enters a blocking state, the I/O requests and notification capability are added to a queue, the queue is checked in a timer to see if there are I/O requests meeting the requirements, and if so, the blocked I/O requests are awakened to complete the issuing process.

3. Fault recovery function module: referring to fig. 9, a method for handling a system service crash caused by a memory error is shown;

when a fault error occurs, the fault can be captured, the most common fault is page fault abnormality, and after capturing is completed, the abnormal system container process can be withdrawn, and all resources are recovered.

Specifically, the failure recovery function is mainly divided into two parts: fault capture and fault recovery. When the application container normally uses the system container function, the system container records the operation of the application container and stores the operation in a memory area of the kernel.

When the fault occurs and the page fault is caused, the fault is transmitted to the kernel, then whether the fault occurs to the system container is judged in the kernel, if so, all the inter-process communication and lock resources in the system container are recovered, the system container is exited, and all the inter-process communication being executed in the system container can receive the retried return code.

And then sending information to the process management system container, wherein after the process management system container acquires the information, the process management system container tries to restart the system container, rebuilds the content in the system container according to the operation recorded before, rebuilds the inter-process communication, and can be used continuously when the rebuilding is completed.

In order to more precisely count the resource usage of the system container and perform fault recovery on the system service process, the invention includes all the system service processes in the container, by the method, the resource of the system service process can be counted and limited, if the limit is exceeded, the fault recovery function can be triggered, and the system container is restarted and recovered.

4. The system container: the system service process in the user mode is also managed as a container, so that the CPU and memory overhead of the system service process can be counted more accurately, and the system service process is managed uniformly.

5. Direct memory access: referring to fig. 10, the I/O speed of the microkernel is accelerated by direct memory access, and the specific method of implementation is to use capability to manage the data to be transferred, and to directly copy the data from the device to the memory or directly write the data from the memory to the device by direct memory access, so as to reduce the number of inter-process communication and the number of repeated memory copies.

The invention also provides a microkernel-based container construction and operation method, wherein the microkernel-based container construction and operation system can be realized by executing the flow steps of the microkernel-based container construction and operation method, namely, a person skilled in the art can understand the microkernel-based container construction and operation method as a preferred implementation mode of the microkernel-based container construction and operation system. The method comprises the following steps:

Next, the present invention will be described in more detail.

A container constructing and running system based on microkernel adopts microkernel architecture and divides namespaces function design, control group function design and fault recovery function design.

First, in the design of the namespace function, referring to fig. 2, the namespace function of the microkernel container of the present solution is mainly the current EL0 layer, and a part of the implementation code is also arranged on the EL1 layer.

For the just started application container, the capability of the corresponding namespace is added to the application container, and the initialization value is 0,0 indicates that the application container uses default initialized system resources in the corresponding system container.

When a user sends inter-process communication to acquire the service of the system container, the capability of the name space of the corresponding system container is acquired from the kernel, and different system resources are selected from the system container carrying the name space information to the user state.

Different system resource partitioning modes need to be designed for different system containers in the microkernel.

For a file system mounting point system container, the function of the system container is to select different mounting points according to a path provided by a user process, return corresponding IDs and capabilities of inter-process communication, and send specific file operations to the corresponding file system container according to the corresponding IDs and capabilities by an application container, and meanwhile provide mount and umount operations.

In the system container, the mounting point information in the system is mainly stored, so that the information is to be isolated, a linked list is used in the system container to maintain a group of mounting point information, in order to enable a new file system mounting point naming space to have different mounting point information, each new file system mounting point naming space is created, a new mounting point information linked list is required to be created and used for storing the mounting point information in the new file system mounting point naming space, and an application container can select a corresponding mounting point information linked list according to the file system mounting point naming space.

For the LwIP network protocol stack system container, the function of the system container is to provide network support for upper layer applications, and to execute open, close, read, write network protocol stack operations according to the request of the user.

In the system container, the main stored information is a network interface list and network data. The network interface list is a series of network interfaces organized by a linked list; the network data is a continuous array of memory for storing data received and transmitted in the network interface.

In the LwIP network protocol stack system container, each time a new network protocol stack namespace is created, an empty linked list is created to store network interfaces in the new network protocol stack namespace, and a loopback network interface is initialized for the new network protocol stack namespace, and a new array of memory continuations is created to process network data in the new network protocol stack namespace.

For a process management system container, the function of the system container is to provide process numbers and process tree support, and the system container is responsible for creating new processes, recovering zombie processes and other operations.

The information mainly stored in the system container is the first process created at system start-up, i.e. the root node of the process tree.

In the process management system container, every time a new process management system container is created, a node in a process tree needs to be selected as a new process tree root node, and any process created in the process management namespace is generated downwards according to the tree structure and cannot be observed by other process management namespaces.

The control group functional design mainly comprises a CPU, a memory and an I/O bandwidth.

The CPU control group is realized in an EL1 layer, and aiming at an application container of a microkernel, in order to meet the real-time requirement and add the function of the CPU control group, a time slice round robin scheduling algorithm based on priority is adopted to support the real-time performance of the application container.

The real-time performance is realized by a method of assigning priorities to the processes when the time slice round-robin scheduling algorithm is based on the priorities, different processes have different priorities, and the processes which are specifically required to be scheduled are selected according to the priorities of the processes when the time slice round-robin scheduling algorithm is scheduled.

If the priority of the process is higher, the process with higher priority is scheduled preferentially in scheduling.

If the priority of the processes are consistent, then time-slice round-robin scheduling is employed to schedule the processes with consistent priority.

The CPU control group cannot be used for processes with different priorities, because there is a process preemption condition, the CPU usage rate cannot be accurately counted, but for processes with the same priority, the CPU control group can be used, and in processes with the same priority, the CPU usage rate of the process is counted during scheduling, and the CPU control group is used for controlling.

Referring to fig. 11, the statistical method of CPU resources is to use the clock interrupt received by the process as the CPU resource usage. Because the clock interrupt can be triggered according to a fixed time interval, when the clock interrupt is triggered, a process running on the CPU which triggers the clock interrupt currently can be acquired in the kernel, and the 1 adding operation is performed on the number of the clock interrupts received by the process.

Because of the existence of inter-process communication under the micro-kernel architecture, a situation that the inter-process communication behavior causes a change of a process main body needs to be considered. The inter-process communication can cause the execution rights of the processes to be switched, when the process of the application container applies for service to the system container through the inter-process communication, the process is not the original process, but is switched to the process of the system container, but the part of the running time still belongs to the process of the application container.

To solve this problem, we focus on the invariants in the inter-process communication process that occurs in the process of the application container, namely the executives (scheduling context), the inter-process communication will pass the executives of the application container process to the system container process, so we store the variables that count the CPU resources in the executives. Even if the process of the application container is switched to the system container process through inter-process communication, the execution time of the application container process can still be correctly counted.

In order to realize the control of the CPU usage rate of the process, besides the above-mentioned CPU resource statistics, the scheduling strategy needs to be modified, so that the CPU usage rate of the process is regulated by adjusting the Time slice (Time Budget) of the process.

And after the time slice of the current process is 0, entering a scheduling strategy. And taking out a process from the waiting queue, calculating the CPU utilization rate of the process, namely the ratio of the number of clock interrupts received by the process to the sum of the number of clock interrupts received by all processes in the waiting queue, and comparing the actual CPU utilization rate with the CPU utilization rate set by a user.

Through the scheduling strategy, the time slices of each process can be independently controlled, and when each process is completely scheduled for one round, the CPU utilization rate of each process accords with the set value of a user.

The memory control group is also implemented in the EL1 layer, and limits the physical memory usage of the application container, including all anonymous memory pages for which page table mappings have been established; for shared memory, the portion of physical memory usage is given to the process that first uses the portion of memory pages.

Referring to fig. 12, a process flow of the memory control group is shown.

Similarly, the inter-process communication needs to be considered, so that corresponding data is counted in an execution body of the process in the same way, and after the inter-process communication occurs, the used physical memory pages can still be counted in an application container correctly.

The I/O bandwidth resources are counted above the driver layer. Under the microkernel architecture, the call flow of the block device I/O is an application container- > file system container- > drive system container. In order to be able to count all block device I/O requests, a new system container, called a limiter, is added between the file system container and the drive system container to control the I/O bandwidth. All data flows in and out through the restrictor.

Referring to FIG. 13, a process flow of the I/O control group is shown.

The logic of the restrictor controlling the I/O bandwidth is implemented mainly by means of token bucket algorithms. The principle of the token bucket algorithm is to maintain a token bucket that has an upper capacity and that continuously generates tokens at a constant rate. When the token bucket is full, the newly generated tokens cannot continue to join the token bucket.

When an I/O request arrives, analyzing the type and the size of the I/O request, selecting a read token bucket or a write token bucket according to the type, checking whether the number of tokens in the token bucket is enough, subtracting the designated number of tokens and issuing the request if the number of tokens in the token bucket is enough; if not, the request is blocked, waiting for the system to have a sufficient number of token buckets.

The I/O read-write frequency bandwidth and the I/O read-write bandwidth can be limited by a token bucket algorithm.

To continue generating tokens at a constant rate, a timer is maintained in the restrictor, running once per second, generating a certain amount of tokens, while also checking if there are blocked I/O requests that have met the issuing requirements.

Blocking and waking up of I/O requests employs a notification mechanism. When the number of tokens is found to be insufficient, a notification capability is created, a blocking state is entered, the I/O requests and notification capability are added to a queue, the queue is checked in a timer to see whether the I/O requests meeting the requirements exist, and if yes, the blocked I/O requests are awakened to complete the I/O request issuing process.

For the strong reliability of the system, for the errors in the system container, a fault recovery function is designed, so that when the internal errors of the system container crash, the crash errors can be captured, the system container is restarted, key data in the crash errors are recovered, and the request of the application container can be re-executed.

Referring to fig. 9, a process flow of fault recovery is shown.

When the system container process generates an error to cause page fault abnormality, the system can capture the error at the page fault interrupt and analyze whether the error is from the system container process, if the error is from the system container process, the logic or the memory inside the system container process generates an error, and the system needs to be restarted.

Firstly, because of page fault abnormality, all inter-process communication data managed by the current system container process can be obtained through the system container process data, locks of inter-process communication are all released, a return value of an application container process is set, inter-process communication connection is set to be unusable, finally, the system container process is exited, when the system container exits, information is sent to the process management system container, and after the process management system container receives the information, the system container is tried to be restarted and internal data is restored.

In the prior art, the application program in the user mode is wrapped in the container, but in the microkernel scene, the system service in the kernel mode is moved up to the user mode, so that the system service process can be contained in the container, all resources in the system are handed to the container for management, in this way, the resources of the system service process can be counted in a finer granularity, and in combination with the fault recovery capability mentioned in step 1, when the system container is in an error, the system service process can be captured and restarted, and the normal operation of the system is not affected.

The management scope of the system container is also different from that of the application container, and the management scope of the application container is a name space and a control group; while the management scope of the system container is control group and failback.

An application container needs to manage the system resources it can access and limit the total amount of resources it can access, so a namespace is used to manage the system resources it can access and a control group is used to limit the total amount of resources it can access.

The fault at the application container is caused by the code written by the user, so the system is not responsible for fault recovery of errors of the user, the fault of the system container is required to be responsible for the system, in order to ensure that the CPU and the memory of the system container are not used beyond the specified limit, the system container counts the CPU and the memory of the system container, if the CPU and the memory resource are used beyond the rated limit, the fault recovery is triggered, and the system container is restarted and recovered, so that the application container can be normally served.

The present invention innovatively proposes the use of direct memory access to speed up microkernel I/O, and for common microkernels, because different system containers are in different address spaces, and the amount of memory used as shared memory between address spaces for inter-process communication is limited, it is necessary to divide a request into multiple requests and complete a complete request through continuous memory copying and a large amount of inter-process communication.

There is a case where a user wants to read a file size of 1MB, and the shared memory is only 4KB, so that the user needs to divide a request into a plurality of requests, and in the process from the file system to the drive, since there is other metadata to be carried, the amount of memory actually available for storing the file content becomes smaller and smaller, and then the number of inter-process communications from the user to the file system becomes thousands, and the number of inter-process communications from the file system to the drive becomes two thousands, and this situation is further exacerbated with the extension of the inter-process communication chain, so that the number of inter-process communications presents a situation in which the number of inter-process communications increases exponentially throughout the request.

To solve the two problems described above, the present invention proposes a solution to accelerate the I/O of microkernels using direct memory access. Referring to FIG. 10, a flow of direct memory access operation I/O is shown.

When the application container sends a request, if the request is a read request, the application container applies for a memory space with a specified size in advance, the memory space with the specified size is transferred to the file system container and the device driving system container as a capability, after the device driving system container takes the capability, whether a physical address corresponding to the capability is mapped or not is checked, if no memory is mapped, the physical memory is mapped, then a direct memory access operation can be directly performed on the physical memory contained in the capability, data can be directly written into the corresponding physical memory space from the device, and because the portion is asynchronous, inter-process communication of the application container can be directly returned, and no data return is carried, and no memory copy and redundant inter-process communication operation are carried.

If the request is a write request, the application container takes the physical memory space corresponding to the data to be written as a capability, sends the capability and other request data to the file system container and the device driving system container together, and after the device driving system container takes the capability, the data can be directly written from the memory to the designated position of the driver in a direct memory access mode, and after the request is issued, the request can be returned.

After the read or write request is completed, an interrupt is triggered telling the system that the current request has been completed.

The embodiment of the invention provides a microkernel-based container construction and operation system and a microkernel-based container construction and operation method, which can provide flexible container support in microkernel environments, can be flexibly transplanted to each microkernel platform without the environmental support of a specific microkernel, can further support the real-time requirement in the specific container through flexible interrupt isolation and scheduling algorithm independence, and can enable the resource statistics to be more accurate, independently manage the behavior of system services and realize stronger resource isolation capability. Compared with the prior art, the invention can obtain performance improvement on the basis of obtaining stronger isolation and safety.

Those skilled in the art will appreciate that the application provides a system and its individual devices, modules, units, etc. that can be implemented entirely by logic programming of method steps, in addition to being implemented as pure computer readable program code, in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Therefore, the system and various devices, modules and units thereof provided by the application can be regarded as a hardware component, and the devices, modules and units for realizing various functions included in the system can also be regarded as structures in the hardware component; means, modules, and units for implementing the various functions may also be considered as either software modules for implementing the methods or structures within hardware components.

The foregoing describes specific embodiments of the present application. It is to be understood that the application is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the application. The embodiments of the application and the features of the embodiments may be combined with each other arbitrarily without conflict.

Claims

1. A microkernel-based container build and run system, comprising: a namespace function module, a control group function module, and a fault recovery function module;

2. The microkernel-based container build and run system of claim 1, wherein the namespace function module comprises: creating, joining, exiting, destroying and using a file system mounting point name space;

The file system mount point namespaces are created as follows:

step 5): creating a new mounting point information linked list;

step 7): acquiring available namespaces ID from the array;

3. The microkernel-based container build and run system of claim 2 wherein the step of joining a file system mount point namespace is as follows:

4. The microkernel-based container build and run system of claim 1 wherein the step of exiting the file system mount point namespace is as follows:

5. The microkernel-based container build and run system as in claim 1 wherein the step of destroying the file system mount point namespaces is as follows:

step 2): acquiring a name space ID sent by a container management program;

step 5): acquiring mounting point information of a path;

6. The microkernel-based container build and run system as in claim 1 wherein the step of using the file system mount point namespace is as follows:

step 5): acquiring a name space ID transferred by a kernel;

step 7): executing a specific application container request;

step 8): after the request is processed, returning to a kernel mode;

step 9): and finishing the request and returning to the application container.

7. The microkernel-based container build and run system of claim 1, wherein the control group function module comprises:

8. The microkernel-based container build and run system of claim 1, wherein the fail-over functional module comprises: when a fault error occurs, capturing the fault, wherein the most common fault is page fault abnormality, and after capturing, performing exit operation on the abnormal system container process, and recovering all resources;

9. The microkernel-based container building and running system according to claim 1, wherein the system container is located in a user state and is also managed as a container, and the CPU and memory overhead of the system service process are counted and the system service process is uniformly managed;

10. A microkernel-based container construction and operation method, comprising: