CN113467981A

CN113467981A - Exception handling method and device

Info

Publication number: CN113467981A
Application number: CN202010246296.0A
Authority: CN
Inventors: 祁德春; 张亮; 鲁志军; 杨杰
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2021-10-01

Abstract

The application provides a method and a device for exception handling, which can carry out hierarchical recovery processing when illegal memory access occurs in a kernel. In the method, when the system memory is abnormally accessed, the memory address where the access abnormality occurs is determined, and before the process with the access abnormality is killed, the lock which is held by the abnormal process and used for protecting the public resource of the kernel is released, so that the quitting of the abnormal process can not cause the non-release of the public resource, other processes can continuously use the public resource of the kernel, the fault-tolerant processing of the kernel is realized, the complete machine reset and the system availability reduction caused by the recoverable memory access abnormality are avoided, and the system reliability is improved.

Description

Exception handling method and device

Technical Field

The present application relates to the field of computers, and more particularly, to a method and apparatus for exception handling.

Background

In the Linux system, the carrier on which the system runs is a process. In kernel space, the kernel itself is a privileged level process, including a system level thread, which maintains the operation of the entire system kernel. In the user space, a plurality of user processes implement different functions, some of which run independently and some of which depend on each other. The kernel process solely shares the kernel space, and the user space process cannot directly access the kernel space, but needs to enter the kernel through a system interface for reading and writing.

At present, when the Linux kernel space encounters illegal address access, the following two schemes can be adopted. In the first scheme, the kernel variable panic _ on _ oops is set to be 1, so that when the kernel processes illegal memory operations, the panic reset system is directly triggered. In the second scheme, the kernel variable panic _ on _ oops is set to be 0, so that the kernel continues to run when illegal memory access is met, and at the moment, only a process which triggers abnormal access needs to be killed (kill). However, in the first scheme, for the terminal device, the system reset caused by panic will affect the user experience, causing negative public opinion; for system equipment, the system reset caused by panic can shorten the available time of the system and reduce the reliability. In the second scheme, when the process accessed abnormally holds a lock for protecting the common resource in the kernel space, the direct exit of the process may cause that other processes cannot continue to use the common resource, resulting in system function abnormality.

Therefore, when an illegal memory access occurs in the kernel, how to perform exception handling is an urgent problem to be solved.

Disclosure of Invention

The application provides a method and a device for exception handling, which can carry out hierarchical recovery processing when illegal memory access occurs in a kernel.

In a first aspect, a method for exception handling is provided, where the method includes:

when the system memory is abnormally accessed, determining the memory address of the abnormally accessed memory;

and under the condition that the memory address with the access exception is in the user space or the memory address with the access exception is in the kernel space and the access operation is read, releasing at least one lock which is held by the process with the access exception and used for protecting the public resource of the kernel, and ending the process with the access exception.

Therefore, in the embodiment of the application, when the memory access exception occurs in the user space, or when the memory access exception occurs in the kernel space and the access is a read operation, before the process with the access exception is killed, by releasing the lock, which is held by the exception process and used for protecting the public resource of the kernel, the exception process does not exit so as not to cause the public resource not to be released, so that other processes can continue to use the public resource of the kernel, thereby realizing the fault-tolerant processing of the kernel, being beneficial to avoiding the complete machine reset and the system availability reduction caused by the recoverable memory access exception, and improving the system reliability.

As an example, when an access exception occurs to the system memory, the memory address where the access exception occurs may be determined by a memory access exception interface function. Memory access exception interface functions, such as static void __ do _ kernel _ fault (struct mm _ struct mm, unsigned long addr, unsigned int esr, struct pt _ regs).

Wherein the second parameter "addr" in the interface function is used to indicate the address where the exception occurred. When the value of "addr" is smaller than TASK _ SIZE, the memory address where the access exception occurs is located in the user space. Conversely, when the value of "addr" is greater than or equal to TASK _ SIZE, the memory address where the access exception occurs is located in the kernel space.

As an example, the operation type of the abnormal access may be determined by a memory access abnormal interface function. The sixth bit value of the third parameter esr in the memory access exception interface function, for example, static void __ do _ kernel _ fault (struct mm _ struct mm, unsigned long addr, unsigned int esr, struct pt _ regs), is used to indicate the operation type of the access exception. When the value of the sixth bit of esr is "0", the operation of the abnormal access is a read operation, and when the value of the sixth bit of esr is "1", the operation of the abnormal access is a write operation.

As a possible implementation, the process of ending the access exception may be implemented by setting the kernel variable panic _ on _ oops to 0, for example, by setting panic _ on _ oops to 0 when the kernel image is built, or by setting the value of/proc/sys/kernel/panic _ on _ oops to 0.

With reference to the first aspect, in certain implementations of the first aspect, the releasing the lock that the access-exception process holds for protecting at least one common resource of the kernel includes:

and releasing the at least one lock for protecting the public resource of the kernel, which is recorded on the lock holding linked list corresponding to the process with the abnormal access.

Therefore, the lock for protecting the public resource of the kernel, which is held by the process, can be managed through the lock holding linked list corresponding to the process, and the release of the lock of at least one public resource containing the kernel, which is held by the process, is realized by releasing the lock recorded on the lock holding linked list.

With reference to the first aspect, in some implementations of the first aspect, the task structure (task _ struct) of the access-abnormal process includes the lock-holding linked list, where the task structure is used to manage the access-abnormal process.

Therefore, the lock holding chain table is contained in the task structure for managing the process, so that the management of the lock of the public resource of the kernel accessed by the process through the task structure is realized, and the compatibility with the existing system architecture is realized.

With reference to the first aspect, in some implementations of the first aspect, before the releasing the lock, recorded on the lock holding linked list corresponding to the access-abnormal process, of the at least one common resource used for protecting the kernel, the method further includes:

when the process enters a kernel and the process holds a first lock, determining whether a resource protected by the first lock is a public resource of the kernel;

and adding the first lock to the tail part of the lock holding linked list under the condition that the resource protected by the first lock is the common resource of the kernel.

With reference to the first aspect, in some implementations of the first aspect, a data structure corresponding to the first lock includes a first member, and the first member is used to indicate a memory address of a resource protected by the first lock.

As an example, in a data structure of a lock held by a process, a new member "protected resource" (i.e., an example of a first member, denoted shared _ resource) may be added, where shared _ resource is the memory address of the resource protected by the lock.

With reference to the first aspect, in some implementations of the first aspect, determining whether the resource protected by the first lock is a common resource of a kernel includes:

and if the memory address of the resource protected by the first lock is located in the uninitialized data segment or the initialized data segment, determining that the resource protected by the lock is the common resource of the kernel.

When the resource address represented by the shared _ resource is located between [ _ bss _ start, _ bss _ end ], or is located between [ _ sdata, _ edata ], it is determined that the resource protected by the lock corresponding to the shared _ resource is a common resource of the kernel, and at this time, the lock may be added to the tail of the lock holding linked list corresponding to the task _ struct of the process. When the resource address represented by the shared _ resource is not located between [ _ bss _ start, _ bss _ end ] and is not located between [ _ sdata, _ data ], it is determined that the resource protected by the lock corresponding to the shared _ resource is not a common resource of the kernel, and at this time, the lock is not added to the lock holding list corresponding to the task _ struct of the process.

Therefore, in the embodiment of the application, when a process enters a kernel, a lock of a public resource of the kernel for protection held by the process is added to a lock-holding chain table, so that the lock-holding chain table can record the lock of the public resource of the kernel for protection held by the process.

With reference to the first aspect, in certain implementations of the first aspect, the method further includes:

and after releasing at least one lock for protecting the public resource of the kernel, which is held by the abnormal access process, or finishing the abnormal access process, restarting the system if the kernel of the system cannot run.

As a possible implementation, triggering the panic reset system can be implemented by setting panic _ on _ oops to 1, for example, setting panic _ on _ oops to 1 when building a kernel image, or by setting the value of/proc/sys/kernel/panic _ on _ oops to 1.

Therefore, according to the embodiment of the application, after the process with the access abnormality holds at least one lock for protecting the public resource of the kernel or finishes the process with the access abnormality, if the kernel of the system cannot run, the kernel does not perform fault tolerance any more, but performs reset restart, so that the memory access abnormality in the kernel is subjected to hierarchical recovery processing.

and restarting the system under the condition that the memory address with the abnormal access is in the kernel space and the access operation is write operation.

Therefore, the embodiment of the application can restart the system under the condition that the address space where the exception occurs is the kernel space and the exception access is the write operation, so that the memory access exception in the kernel is subjected to the hierarchical recovery processing.

and releasing at least one lock which is used for protecting the public resource of the kernel and held by the abnormal process, and restarting the system under the condition that the frequency of finishing the abnormal process is greater than a preset value.

In this way, in the embodiment of the application, at least one lock for protecting the public resource of the kernel, which is held by the abnormal process, can be released within a preset time period, and the system is restarted under the condition that the number of times of ending the abnormal process is greater than a preset value, so that the hierarchical recovery processing of the memory access abnormality in the kernel is realized.

In a second aspect, an embodiment of the present application provides an exception handling apparatus, configured to execute the method in the first aspect or any possible implementation manner of the first aspect, and specifically, the apparatus includes a module configured to execute the method in the first aspect or any possible implementation manner of the first aspect.

In a third aspect, an embodiment of the present application provides an exception handling apparatus, including: one or more processors; a memory for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement a method as in the first aspect above or any possible implementation manner of the first aspect.

In a fourth aspect, the present application provides a computer-readable medium for storing a computer program including instructions for executing the method of the first aspect or any possible implementation manner of the first aspect.

In a fifth aspect, the present application further provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the method of the first aspect or any possible implementation manner of the first aspect.

Drawings

FIG. 1 is a schematic block diagram of a computer system according to an embodiment of the present application;

FIG. 2 is an example of a task structure provided by an embodiment of the present application;

FIG. 3 is a schematic flow chart diagram of a method for exception handling provided by an embodiment of the present application;

FIG. 4 is a specific example of exception recovery provided by an embodiment of the present application;

FIG. 5 is another specific example of exception recovery provided by an embodiment of the present application;

FIG. 6 is another specific example of exception recovery provided by an embodiment of the present application;

FIG. 7 is a schematic block diagram of an exception handling apparatus provided in an embodiment of the present application;

fig. 8 is a schematic block diagram of another exception handling apparatus provided in an embodiment of the present application.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

Fig. 1 is a schematic structural diagram of a computer system according to an embodiment of the present application. The computer system may be a terminal device, such as a smart phone, a personal computer, a tablet computer, a vehicle-mounted device, an intelligent appliance, an artificial intelligence device, or a system device, such as a server, which is not limited in this embodiment of the present application. As shown in fig. 1, the computer system includes a communication module 110, a sensor 120, a user input module 130, an output module 140, a processor 150, an audio-visual input module 160, a memory 170, and a power supply 180.

The communication module 110 may include at least one module that enables communication between the computer system and other computer systems. For example, the communication module 110 may include one or more of a wired network interface, a broadcast receiving module, a mobile communication module, a wireless internet module, a local area communication module, and a location (or position) information module, etc. The various modules are implemented in various ways in the prior art, and are not described in the application.

The sensor 120 may sense a current state of the system, such as an open/close state, a position, whether there is contact with a user, a direction, and acceleration/deceleration, and the sensor 120 may generate a sensing signal for controlling the operation of the system.

The user input module 130 is configured to receive input digital information, character information, or contact touch operation/non-contact gesture, and to receive signal input related to user setting and function control of the system. The user input module 130 includes a touch panel and/or other input devices.

The output module 140 includes a display panel for displaying information input by a user, information provided to the user, various menu interfaces of a system, and the like. Alternatively, the display panel may be configured in the form of a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), or the like. In other embodiments, the touch panel can be overlaid on the display panel to form a touch display screen. In addition, the output module 140 may further include an audio output module, an alarm, a haptic module, and the like.

And the audio and video input module 160 is used for inputting audio signals or video signals. The audio/video input module 160 may include a camera and a microphone.

The power supply 180 may receive external power and internal power under the control of the processor 150 and provide power required for the operation of the various components of the system.

Processor 150 may be indicative of one or more processors, for example, processor 150 may include one or more central processors, or include a central processor and a graphics processor, or include an application processor and a co-processor (e.g., a micro-control unit or a neural network processor). When the processor 150 includes multiple processors, the multiple processors may be integrated on the same chip or may be separate chips. A processor may include one or more physical cores, where a physical core is the smallest processing module.

The memory 170 stores computer programs including an operating system program 172, an application program 171, and the like. Typical operating systems are those for desktop or notebook computers such as Windows from Microsoft corporation, MacOS from apple Inc., and others such as those developed by Google Inc

Android of

System, etc. for a mobile terminal.

The memory 170 may be one or more of the following types: flash (flash) memory, hard disk type memory, micro multimedia card type memory, card type memory (e.g., SD or XD memory), Random Access Memory (RAM), Static Random Access Memory (SRAM), Read Only Memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, or optical disk. In other embodiments, the memory 170 may be a network storage device on the internet, and the system may perform an update or read operation on the memory 170 on the internet.

The processor 150 is configured to read the computer program in the memory 170 and then execute a method defined by the computer program, such as the processor 150 reading the operating system program 172 to run an operating system on the system and implement various functions of the operating system, or reading the one or more application programs 171 to run an application on the system.

The memory 170 further stores other data 173 besides computer programs, which is not limited in the embodiment of the present application.

The connection relationship of each module in fig. 1 is only an example, and the method provided in any embodiment of the present application may also be applied to systems with other connection manners, for example, all modules are connected through a bus, which is not limited in this embodiment of the present application.

The following describes embodiments disclosed in the present application by taking the operating system of the computer system in fig. 1 as a Linux system. It should be noted that the embodiments disclosed in the following description of the present application are not limited to the Linux system, and other systems may also perform exception handling by using the method disclosed in the embodiments of the present application.

In the Linux system, a virtual memory is divided into a kernel space (kernel space) and a user space (user space). The kernel space is an operating space of a kernel program in the system, that is, a space in which kernel code operates. The user space is the running space of the user program in the system, i.e. the space allowed by the user program code. The kernel space and the user space are isolated, the kernel process solely shares the kernel space, and the user space process cannot directly access the kernel space but needs to enter the kernel through system interface calling to perform read-write operation.

Because more than half of faults in the kernel of the Linux system have the influence range only in the process, the public resource of the kernel space is not polluted, and the faults cannot influence the operation of other processes in the system. Thus, for these types of crash failures, only a kill (kill) is required and the process restarted. Based on this, in the embodiment of the application, when the Linux kernel encounters a crash fault, a system is not restarted or put through with one switch, but a hierarchical recovery process is performed.

When the process that crashes fails does not hold a lock that protects the kernel's public resources (i.e., shared resources), the process's direct exit does not affect other processes from continuing to use the public resources in the kernel. However, once the process holds the lock for protecting the public resource of the kernel, the direct exit of the process will cause the lock of the public resource not to be released, and at this time, other processes cannot continue to use the public resource of the kernel, resulting in system function abnormality. Therefore, in the embodiment of the present application, for a process holding a lock for protecting a common resource of a kernel, the process should exit and release the lock for protecting the common resource of the kernel held by the process.

As an implementation manner, whether the resource protected by the lock is the common resource of the kernel may be determined according to whether the address of the resource protected by the lock held by the process is located in an address interval corresponding to the common resource of the kernel. When the address of the resource protected by the lock held by the process is located in the address interval corresponding to the public resource of the kernel, it can be determined that the resource protected by the lock is the public resource of the kernel, that is, the process holds the lock for protecting the public resource of the kernel at this time. When the address of the resource protected by the lock held by the process is not located in the address interval corresponding to the common resource of the kernel, it may be determined that the resource protected by the lock is not the common resource of the kernel, that is, the process does not hold the lock for protecting the common resource of the kernel at this time.

To determine whether a resource protected by a lock held by a process is a common resource of the kernel, an association between the lock held by the process and its protected resource needs to be established. Based on this, the embodiment of the application modifies the existing lock mechanism of the kernel, and establishes the association relationship between the lock held by the process and the protected resource.

By way of example, in the embodiment of the present application, a new member "protected resource" may be added to an original data structure of a "lock", so that an association relationship between the "lock" and the "protected resource" may be established. As a specific example, the data structure of the lock after the reform is as follows:

struct xxxlock{

…, respectively; v original structural definition +

void*shared_resource；

}；

Wherein shared _ resource in the data structure is the memory address of the resource protected by the lock xxxlock.

Further, in the embodiment of the application, an original interface function xxx _ lock (& lock) can be modified into a new interface function xxx _ lock (& lock, & shared _ resource), and during the function implementation process, shared _ resource is assigned to a member shared _ resource of lock.

In some embodiments, a lock holding linked list corresponding to the process may also be established, where the lock holding linked list records at least one lock held by the process for protecting a common resource of the kernel. Illustratively, the list of lock holding chains includes an identification of at least one lock held by the process for protecting a common resource of the kernel. Therefore, the lock chain holding table corresponding to the process can be used for managing the lock which is held by the process and used for protecting the public resource of the kernel.

As a specific example, the definition of a data structure (e.g., task _ struct) for managing a process may be modified, for example, adding a holding lock linked list to the existing task _ struct. Fig. 2 shows an example of a task structure provided by an embodiment of the present application. As shown in fig. 2, a lock list (lock _ list) is newly added to the task structure (i.e., task _ struct), and n different locks, such as lock 1, lock 2 through lock n, are recorded in the lock list, where the n different locks are respectively used to protect different common resources of the kernel. For example, the n locks may be mutual exclusion locks, read-write locks, and the like, which is not limited in this embodiment of the application. It will be appreciated that the lock list is one specific example of the lock holding list described above.

It should be noted that fig. 2 only shows an example of one task structure provided by the embodiment of the present application, but the embodiment of the present application is not limited thereto. For example, the number of lock lists in the task structure may also be more than one, such as two, or others.

As an implementation manner, when a process calls to enter a kernel through a system interface and the process holds a lock, if it is determined that a resource protected by the lock held by the kernel is a public resource of the kernel, an identifier of the lock may be added to a tail portion of a lock holding linked list corresponding to the process.

As an example, whether the resource protected by the lock is a common resource of the kernel may be determined according to whether the resource address represented by shared _ resource in the lock held by the kernel is located between uninitialized data segments (. bss) or is located between initialized data segments (. data segments). Wherein, the < bss segment is [ _ bss _ start, _ bss _ end ],. data segment is [ _ sdata, _ edata ].

When the resource address represented by the shared _ resource is located between [ _ bss _ start, _ bss _ end ], or is located between [ _ sdata, _ edata ], it is determined that the resource protected by the lock corresponding to the shared _ resource is a common resource of the kernel, and at this time, the lock may be added to the tail of the lock holding linked list corresponding to the task _ struct of the process.

When the resource address represented by the shared _ resource is not located between [ _ bss _ start, _ bss _ end ] and is not located between [ _ sdata, _ data ], it is determined that the resource protected by the lock corresponding to the shared _ resource is not a common resource of the kernel, and at this time, the lock is not added to the lock holding list corresponding to the task _ struct of the process.

In the embodiment of the application, when the kernel crashes, the hierarchical recovery processing can be performed on the operation type of the shared resource according to the address space and the abnormal process when the fault occurs. Here, the type of operation of the exception process on the shared resource is, for example, a read operation, or a write operation.

Fig. 3 shows a schematic flow chart of a method 300 for exception handling provided by an embodiment of the present application. For example, the method 300 may be performed by the computer system shown in fig. 1, or by a unit or module (e.g., a processor) included in the computer system. Method 300 includes step 310 and step 320.

And 310, when the system memory has access abnormality, determining the memory address where the access abnormality occurs.

Illustratively, when the system memory access is abnormal, the kernel calls a memory access abnormal interface function, such as static void __ do _ kernel _ fault (struct mm _ struct mm, unsigned long addr, unsigned int esr, struct pt _ regs). Wherein the second parameter "addr" in the interface function is used to indicate the address where the exception occurred. At this time, the determination of the memory address where the memory access exception occurs may be implemented by obtaining the parameter "addr".

And 320, when the memory address with the abnormal access exists in the user space, or when the memory address with the abnormal access exists in the kernel space and the access operation is a read operation, releasing at least one lock which is used for protecting the public resource of the kernel and held by the process with the abnormal access, and ending the process with the abnormal access.

In other words, when the memory address where the access exception occurs is in the user space, or when the memory address where the access exception occurs is in the kernel space and the access operation is a read operation, the access exception of the memory is a recoverable memory access exception, at this time, the process of the access exception can be killed, and fault-tolerant processing of the kernel when the memory access exception occurs is realized, that is, the kernel continues to run. However, unlike the prior art, in the embodiment of the present application, while the process with the access exception is killed, a lock for protecting a common resource of the kernel, which is held by the process with the access exception, needs to be released.

For example, the access-impaired process may be terminated after releasing at least one lock for protecting a common resource of the kernel held by the access-impaired process. Here, ending the access-abnormal process refers to killing the abnormally accessed process or exiting the abnormally accessed process. As one implementation, a function do _ exit may be called to implement the process of exiting the access exception.

As a possible implementation, the process of ending the access exception may be implemented by setting the kernel variable panic _ on _ oops to 0, for example, by setting panic _ on _ oops to 0 when the kernel image is built, or by setting the value of/proc/sys/kernel/panic _ on _ oops to 0. Thus, when the memory address with the access exception is in the user space, or the memory address with the access exception is in the kernel space and the access operation is a read operation, the system does not immediately reset and restart (for example, trigger the pancic reset system), but releases at least one lock for protecting the public resource of the kernel, which is held by the process with the access exception, and kills the process with the access exception, so that the kernel performs fault-tolerant processing when the memory access exception occurs, that is, the kernel continues to run.

It should be noted that, in the embodiment of the present application, releasing the lock for protecting the common resource of the kernel, that is, releasing the common resource of the kernel protected by the lock, both represent the same meaning.

In some alternative embodiments, if the kernel cannot continue to run during the execution of step 320 or after the execution of step 320, that is, after releasing at least one lock for protecting the common resource of the kernel, which is held by the process with the access exception, or ending the process with the access exception, a reset restart may be performed, for example, triggering a pancic reset system.

As a possible implementation manner, the triggering of the pandic reset system can be realized by setting pandic _ on _ oops to 1, for example, setting pandic _ on _ oops to 1 when constructing the kernel image, or setting the value of/proc/sys/kernel/pandic _ on _ oops to 1, so that the kernel does not perform fault tolerance any more, but performs reset restart.

In step 320, the memory address where the access exception occurs may be obtained through the memory access exception interface, and it is determined whether the memory address where the access exception occurs is in the kernel space or the user space according to the value of the memory address.

Illustratively, static void __ do _ kernel _ fault is an example of a memory access exception interface. As a specific example, the memory address where the access abnormality occurs may be represented by "addr" in static void __ do _ kernel _ fault (struct mm _ struct mm, unsigned long addr, unsigned int esr, struct pt _ regs). According to the value of addr, whether the address where the exception occurs is located in the kernel space or the kernel space can be judged. As an example, when the value of "addr" is smaller than TASK _ SIZE, the memory address where the access exception occurs is located in the user space. Conversely, when the value of "addr" is greater than or equal to TASK _ SIZE, the memory address where the access exception occurs is located in the kernel space.

In step 320, the operation type of the access exception may be determined by the memory access exception interface. For example, whether the access-abnormal operation is a write operation or a read operation may be determined according to the value of the sixth bit of the third parameter esr in static void __ do _ kernel _ fault (struct mm _ struct mm, unsigned long addr, unsigned int esr, struct pt _ regs). As an example, when the value of the sixth bit of esr is "0", the operation of the abnormal access is a read operation, and when the value of the sixth bit of esr is "1", the operation of the abnormal access is a write operation.

In some optional embodiments, the releasing of the lock, which is held by the process and used for protecting the common resource of the kernel, may be implemented by releasing the lock recorded on the lock holding chain table corresponding to the process with the access exception. Wherein, the lock recorded on the lock holding chain table comprises at least one lock held by the process and used for protecting the public resource of the kernel.

Illustratively, the release of the lock on the holding chain table in the task _ struct corresponding to the process accessed abnormally can be realized through the function static void __ do _ kernel _ fault, so as to release the lock of at least one common resource used for protecting the kernel and held by the process.

Before step 310, i.e., during normal operation of the process, when the process enters the kernel and holds a lock, it may be determined whether the resource protected by the lock is a public resource. And when the resource protected by the lock is a public resource, adding the lock to the tail of the corresponding lock holding linked list of the process. And under the condition that the resource protected by the lock is not a public resource, the lock does not need to be added into a corresponding lock holding linked list of the process.

For example, the process of adding the lock of the common resource of the user protection kernel to the holding chain table may refer to the related description above, and for brevity, the details are not described here again.

As a possible implementation, whether the resource protected by the lock is a common resource of the kernel can be determined according to whether the memory address of the resource protected by the lock is located in the bss section and the data section. And when the memory address of the resource protected by the lock is located in the bss section or the data section, determining that the resource protected by the lock is the common resource of the kernel.

For example, it is determined whether the resource protected by the lock is a common resource of the kernel, which may be referred to the above description, and for brevity, details are not described here again.

In some embodiments, in the case that the memory address where the access exception occurs is in the kernel space and the operation of the access is a write operation, the system is restarted, that is, the system is reset and restarted, for example, the pancic reset system is triggered.

In other words, when the memory address where the access exception occurs is in the kernel space and the access operation is a write operation, the access exception of the memory is an unrecoverable memory access exception, and at this time, the system can be reset and restarted without performing fault-tolerant processing on the kernel.

For example, when the value of "addr" in static void __ do _ kernel _ fault (struct mm _ struct mm, unsigned long addr, unsigned int esr, struct pt _ regs) is greater than or equal to TASK _ SIZE, it may be determined that the memory address where the exception occurs is located in kernel space. At this point, the system is reset and restarted, for example triggering a panic direct reset. As a possible implementation manner, the kernel variable panic _ on _ oops may be set to 1 at this time, for example, the value of the file/proc/sys/kernel/panic _ on _ oops may be set to 1, so that the kernel directly generates the panic reset system.

Therefore, in the embodiment of the application, when the memory access exception occurs in the user space, or when the exception occurs in the kernel space and the exception access is a read operation, the fault-tolerant processing of the kernel is realized, and when the address space where the exception occurs is in the kernel space and the exception access is a write operation, the system is restarted, so that the memory access exception in the kernel is subjected to the hierarchical recovery processing, thereby being beneficial to avoiding the complete machine reset and the system availability reduction caused by the recoverable memory access exception, and improving the system reliability.

In some optional embodiments, when the number of times of executing the process that releases the lock for protecting the common resource of the kernel and ends the access exception process is greater than a preset value within a preset time period, the system is restarted, that is, the system is reset and restarted, for example, the pancic reset system is triggered.

When the operation of releasing the lock of the at least one public resource for protecting the kernel, which is held by the access-abnormal process, and ending the access-abnormal process is referred to as a first operation, it may also be described that, in a preset time period, if the number of times of executing the first operation is greater than a preset value, the system is restarted, that is, the system is reset and restarted, for example, a pancic reset system is triggered.

That is, when the number of times of the memory access exception of the system within a specified time (for example, 12 hours or 24 hours) reaches a preset number of times (for example, a memory access exception limit configured in advance by the system), it indicates that the fault-tolerant processing for the kernel cannot allow the system to be allowed normally, and at this time, the system may be restarted, that is, the system is reset and restarted, so as to improve the reliability of the system. Specifically, the manner of restarting the system may refer to the description of the steps above, and is not described herein again for brevity.

Three specific examples of the exception recovery provided by the embodiment of the present application are described below with reference to fig. 4 to 6. It should be understood that these examples are only for assisting the skilled person in understanding the aspects of the present application, and do not constitute any limitation to the embodiments of the present application.

In fig. 4, the address where the memory abnormal access occurs is in the user space, and at this time, the process of the abnormal access may be ended after releasing at least one lock for protecting the common resource of the kernel, which is held by the process of the abnormal access. Referring to fig. 4, the method includes steps 401 to 407.

401, a process has a memory access exception at kernel module X (KernelModule _ X).

For example, the kernel module X may be a driver module, a memory management module, a file system management module, a network management module, and the like, which is not limited in this embodiment of the present application.

402, the kernel anomaly detection module detects a memory access anomaly.

403, the kernel exception detecting module calls the fault handling interface, that is, the exception is handed to the kernel fault handling module for handling.

And 404, calling a corresponding exception handling interface by the kernel fault handling module according to the exception type. Wherein, the exception handling interface may be registered in advance.

Specifically, step 401 to step 404 may refer to the prior art, and are not described herein again.

405, the kernel fault handling module calls a memory access exception interface.

Here, the memory access exception interface is static void __ do _ kernel _ fault (), for example. Specifically, the static void __ do _ kernel _ fault () can be referred to the description in fig. 3, and for brevity, the description is not repeated here.

And 406, determining that the memory access occurs in the user space, and releasing the public resource.

Specifically, when the memory access is determined to occur in the user space in the memory abnormal access interface, the common resource is released. As an example, a lock may be released for at least one user-protected kernel's common resources held by the exception process. Specifically, step 406 may refer to the related description in fig. 3, and for brevity, will not be described here again.

407, calling the process exit interface.

For example, in the case that the execution of step 406 is completed successfully, step 407 is executed to let the exception process exit by itself.

Therefore, in the embodiment of the application, when an exception occurs in a user space, that is, when a recoverable memory access exception occurs, by releasing a lock for protecting a public resource of a kernel, which is held by the exception process, the exception process does not exit so that the public resource is not released, that is, other processes can continue to use the public resource of the kernel, so that after the program exits, the kernel can perform fault-tolerant processing, and thus, the reset of the whole machine caused by the recoverable memory access exception can be avoided, the decrease of the system availability is avoided, and the system reliability is improved.

Therefore, in the embodiment of the application, when the memory access abnormality occurs in the user space, before the process with the access abnormality is killed, the lock for protecting the public resource of the kernel, which is held by the abnormal process, is released, so that the exception of the abnormal process does not cause the non-release of the public resource, and other processes can continue to use the public resource of the kernel, thereby realizing the fault-tolerant processing of the kernel, being beneficial to avoiding the complete machine reset and the system availability reduction caused by the recoverable memory access abnormality, and improving the system reliability.

In fig. 5, the address where the memory abnormal access occurs is in the kernel space, and the abnormal access is a read operation, at this time, the process of the abnormal access may be ended after at least one lock for protecting the public resource of the kernel, which is held by the process of the abnormal access, is released. Referring to fig. 5, the method includes steps 501 to 507.

501, a process generates a memory access exception in kernel module X (KernelModule _ X).

502, the kernel anomaly detection module detects a memory access anomaly.

503, the kernel exception detecting module calls the fault handling interface, that is, the exception is handed to the kernel fault handling module for handling.

And 504, the kernel fault processing module calls a corresponding exception handling interface according to the exception type. Wherein, the exception handling interface may be registered in advance.

505, the kernel fault handling module calls a memory access exception interface.

Specifically, step 501 to step 505 can refer to descriptions in step 401 to step 405 in fig. 4, and are not described herein again for brevity.

And 506, determining that the memory access occurs in the kernel space and the abnormal access is a read operation, and releasing the public resource.

Specifically, when it is determined in the memory exception access interface that the memory access occurs in the kernel space and the exception access is a read operation, the common resource is released. As an example, a lock may be released for at least one user-protected kernel's common resources held by the exception process. Specifically, step 506 can refer to the related description in fig. 3, and is not described herein again for brevity.

507, calling the process exit interface.

For example, in the case that the execution of step 506 is completed successfully and the kernel is not restarted, step 507 is executed to allow the exception process to exit by itself.

Therefore, in the embodiment of the application, when the memory access exception occurs in the kernel space and the access is a read operation, before the process with the access exception is killed, the lock for protecting the public resource of the kernel, which is held by the exception process, is released, so that the exception process does not cause the public resource to be not released, and other processes can continue to use the public resource of the kernel, thereby realizing fault-tolerant processing of the kernel, being beneficial to avoiding complete machine reset and system availability reduction caused by recoverable memory access exception, and improving system reliability.

In fig. 6, the address where the memory abnormal access occurs is in the kernel space, and the abnormal access is a write operation, or the number of times of the abnormal access in the kernel reaches a specified value, at this time, the system can be directly reset. Referring to fig. 6, the method includes steps 601 to 606.

601, the process generates memory access exception in kernel module X (KernelModule _ X).

602, the kernel anomaly detection module detects a memory access anomaly.

603, the kernel exception detecting module calls the fault handling interface, that is, the exception is handed to the kernel fault handling module for handling.

604, the kernel fault handling module calls a corresponding exception handling interface according to the exception type. Wherein, the exception handling interface may be registered in advance.

605, the kernel fault handling module calls a memory access exception interface.

Specifically, step 601 to step 605 may refer to descriptions in step 401 to step 405 in fig. 4, and for brevity, are not described again here.

And 606, determining that the memory access occurs in the kernel space and the abnormal access is write operation, or determining that the abnormal times reach a specified value, and restarting and resetting the system.

Specifically, when it is determined in the memory abnormal access interface that the memory access occurs in the kernel space and the abnormal access is a write operation, or the number of times of abnormality in the kernel reaches a specified value within a preset time, the system is restarted and reset. As an example, panic may be triggered directly to perform a system reset. Specifically, step 606 can refer to the related description in fig. 3, and for brevity, will not be described here again.

Therefore, in the embodiment of the application, the system is restarted under the condition that the address space in which the exception occurs is in the kernel space and the exception access is write operation, or under the condition that the number of times of the exception in the kernel reaches the preset value within the preset time, so that the memory access exception in the kernel is subjected to the hierarchical recovery processing, the reset of the whole machine and the reduction of the system availability caused by the recoverable memory access exception are avoided, and the system reliability is improved.

An embodiment of the present application further provides an exception handling apparatus, please refer to fig. 7. For example, the exception handling apparatus 700 may be the computer system of fig. 1, or a unit or module (e.g., a processor) included in the computer system of fig. 1. In the embodiment of the present application, the apparatus 700 may include a determining unit 710 and a processing unit 720.

The determining unit 710 is configured to determine, when an access exception occurs to the system memory, a memory address where the access exception occurs.

The processing unit 720 is configured to, when the memory address where the access exception occurs is in the user space, or when the memory address where the access exception occurs is in the kernel space and the access operation is a read operation, release at least one lock that is held by the process with the access exception and used for protecting a common resource of the kernel, and end the process with the access exception.

In some possible implementations, the processing unit 720 is specifically configured to release the lock, recorded on the lock holding linked list corresponding to the process with the access exception, of the at least one lock used for protecting the common resource of the kernel.

In some possible implementations, the lock holding linked list is included in a task structure of the access-abnormal process, where the task structure is used to manage the access-abnormal process.

In some possible implementations, the determining unit 710 is further configured to determine, when the process enters a kernel and the process holds a first lock, whether a resource protected by the first lock is a common resource of the kernel;

the processing unit 720 is further configured to add the first lock to the tail of the lock holding linked list if the resource protected by the first lock is a common resource of the kernel.

In some possible implementations, the determining unit 710 is specifically configured to:

In some possible implementations, the data structure corresponding to the first lock includes a first member, and the first member is used to indicate a memory address of a resource protected by the first lock.

In some possible implementations, the processing unit 720 is further configured to restart the system if the kernel of the system cannot run after releasing at least one lock for protecting common resources of the kernel, which is held by the access-abnormal process, or ending the access-abnormal process.

In some possible implementations, the processing unit 720 is further configured to restart the system if the memory address where the access exception occurs is in the kernel space and the operation of the access is a write operation.

In some possible implementations, the processing unit 720 is further configured to release, within a preset time period, at least one lock for protecting a common resource of the kernel, which is held by the abnormal access process, and restart the system if the number of times of ending the abnormal access process is greater than a preset value.

It should be noted that, in the embodiment of the present application, the determining unit 720 and the processing unit 720 may be implemented by a processor. Fig. 8 is a schematic block diagram illustrating another exception handling apparatus 800 according to an embodiment of the present application. As shown in fig. 8, apparatus 800 may include a processor 810 and a memory 820. Wherein memory 820 may be used for code executed by processor 810, etc.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 810. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 820, and the processor 810 reads the information in the memory 820 and combines the hardware to complete the steps of the above method. To avoid repetition, it is not described in detail here.

Operations or steps performed by the apparatus 700 for exception handling shown in fig. 7 or the apparatus 800 for exception handling shown in fig. 8 may refer to the related descriptions of the operations or steps in the foregoing method embodiments, and are not repeated here to avoid repetition.

Embodiments of the present application further provide a computer-readable storage medium, which includes a computer program and when the computer program runs on a computer, the computer is caused to execute the method provided by the above method embodiments.

Embodiments of the present application further provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the method provided by the above method embodiments.

It should be understood that the processor mentioned in the embodiments of the present invention may be a Central Processing Unit (CPU), and may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will also be appreciated that the memory referred to in this embodiment of the invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate SDRAM, enhanced SDRAM, SLDRAM, Synchronous Link DRAM (SLDRAM), and direct rambus RAM (DR RAM).

It should be noted that when the processor is a general-purpose processor, a DSP, an ASIC, an FPGA or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, the memory (memory module) is integrated in the processor.

It should be noted that the memory described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

It should be understood that the descriptions of the first, second, etc. appearing in the embodiments of the present application are only for illustrating and differentiating the objects, and do not represent a particular limitation to the number of devices in the embodiments of the present application, and do not constitute any limitation to the embodiments of the present application.

It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in this document indicates that the former and latter related objects are in an "or" relationship.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of exception handling, comprising:

2. The method of claim 1, wherein releasing the lock on the common resource of the kernel for protecting the common resource held by the access-exception process comprises:

3. The method of claim 2, wherein the chain-holding list is included in a task structure of the access-exception process, wherein the task structure is used to manage the access-exception process.

4. The method according to claim 2 or 3, wherein before releasing the lock for protecting the common resource of the kernel, which is recorded on the lock holding chain table corresponding to the access-abnormal process, the method further comprises:

5. The method of claim 4, wherein determining whether the resource protected by the first lock is a common resource of a kernel comprises:

6. The method of claim 5, wherein the data structure corresponding to the first lock includes a first member, and wherein the first member is used to indicate a memory address of a resource protected by the first lock.

7. The method of any one of claims 1-6, further comprising:

8. The method of any one of claims 1-7, further comprising:

9. The method according to any one of claims 1-8, further comprising:

10. An apparatus for exception handling, comprising:

the system comprises a determining unit, a judging unit and a judging unit, wherein the determining unit is used for determining a memory address where access abnormality occurs when the system memory has access abnormality;

and the processing unit is used for releasing at least one lock which is used for protecting the public resource of the kernel and is held by the abnormal process when the memory address which is abnormally accessed exists in the user space or the memory address which is abnormally accessed exists in the kernel space and the access operation is read operation, and finishing the abnormal process.

11. The apparatus according to claim 10, wherein the processing unit is specifically configured to:

12. The apparatus of claim 11, wherein the chain-holding list is included in a task structure of the access-exception process, and wherein the task structure is configured to manage the access-exception process.

13. The apparatus of claim 11 or 12,

the determining unit is further configured to determine whether a resource protected by a first lock is a common resource of a kernel when the process enters the kernel and the process holds the first lock;

the processing unit is further configured to add the first lock to the tail of the lock holding linked list when the resource protected by the first lock is a common resource of the kernel.

14. The apparatus according to claim 13, wherein the determining unit is specifically configured to:

15. The apparatus of claim 14, wherein the data structure corresponding to the first lock comprises a first member, and wherein the first member is configured to indicate a memory address of a resource protected by the first lock.

16. The apparatus according to any one of claims 10 to 15,

the processing unit is also used for restarting the system if the kernel of the system cannot run after releasing at least one lock for protecting the public resource of the kernel, which is held by the abnormal access process, or finishing the abnormal access process.

17. The apparatus according to any one of claims 10 to 16,

and the processing unit is also used for restarting the system under the condition that the memory address with the abnormal access is in the kernel space and the access operation is write operation.

18. The apparatus of any one of claims 10-17,

the processing unit is further configured to release at least one lock for protecting the common resource of the kernel, which is held by the access-abnormal process, within a preset time period, and restart the system when the number of times of ending the access-abnormal process is greater than a preset value.

19. A computer system, comprising:

one or more processors;

a memory;

the memory stores one or more computer programs, the one or more computer programs comprising instructions, which when executed by the one or more processors, cause the computer system to perform the method of any of claims 1-9.

20. A computer-readable storage medium comprising instructions that, when executed on a computer system, cause the computer system to perform the method of any of claims 1-9.