CN115421960A - UE memory fault recovery method, device, electronic equipment and medium - Google Patents

UE memory fault recovery method, device, electronic equipment and medium Download PDF

Info

Publication number
CN115421960A
CN115421960A CN202211193186.8A CN202211193186A CN115421960A CN 115421960 A CN115421960 A CN 115421960A CN 202211193186 A CN202211193186 A CN 202211193186A CN 115421960 A CN115421960 A CN 115421960A
Authority
CN
China
Prior art keywords
memory
page
fault
shared memory
failure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211193186.8A
Other languages
Chinese (zh)
Inventor
高仲于
李诗逸
刁家庆
丁辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN202211193186.8A priority Critical patent/CN115421960A/en
Publication of CN115421960A publication Critical patent/CN115421960A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1405Saving, restoring, recovering or retrying at machine instruction level
    • G06F11/141Saving, restoring, recovering or retrying at machine instruction level for bus or memory accesses

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

The application discloses a UE memory fault recovery method, a UE memory fault recovery device, electronic equipment and a UE memory fault recovery medium, and relates to the technical field of computers. According to the UE memory fault recovery method provided by the application, when the UE memory fault is triggered, the UE memory fault address of the UE memory fault is firstly obtained, then under the condition that the page of the UE memory fault address is the shared memory page, the process of accessing the shared memory page is searched, and finally the process of accessing the shared memory page is finished and the UE memory fault is isolated. In the method, under the condition that the page of the UE fault address is a shared memory page, the process triggering the UE memory fault is ended and isolated, so that the process triggering the UE memory fault does not trigger the UE memory fault any more, namely the shared memory page is not marked any more, and other processes can acquire the shared memory page which is not isolated when accessing the memory, thereby ensuring that the system memory fault can be recovered in a short time and ensuring the normal operation of the system.

Description

UE memory fault recovery method, device, electronic equipment and medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for recovering a UE memory failure, an electronic device, and a medium.
Background
In the server system, when a process accesses a memory, a memory failure may be triggered, and the type of the triggered memory failure may be a Correctable Error (CE) type memory failure or an Uncorrectable Error (UE) type memory failure.
A common recovery method for UE memory failure is to perform UE memory failure recovery through Machine Check Exception (MCE) interruption. When a plurality of processes access the same shared memory page, and when one of the processes triggers a UE failure in the shared memory page, the specific failure recovery scheme is to isolate the shared memory page. Since the memory page is a shared memory page, there may be a case where the remaining processes access the memory page. When other processes access the memory page, the shared memory page is isolated, so that the other processes acquire the isolated shared memory page and quit after receiving the SIGBUS signal, so that the system cannot be normally started or operated, and finally, the system memory fault cannot be recovered for a long time, and the system cannot normally operate.
Therefore, how to enhance the UE memory failure recovery is an urgent problem to be solved by those skilled in the art.
Disclosure of Invention
The application aims to provide a UE memory failure recovery method, a UE memory failure recovery device, electronic equipment and a UE memory failure recovery medium, which are used for enhancing UE memory failure recovery.
In order to solve the above technical problem, the present application provides a method for recovering a memory failure of a UE, including:
when a UE memory fault is triggered, acquiring a UE memory fault address of the UE memory fault;
searching a process for accessing the shared memory page under the condition that the page of the UE memory fault address is the shared memory page;
ending the process of accessing the shared memory page;
and isolating the memory fault of the UE.
Preferably, the process of searching for and accessing the shared memory page includes:
searching a process for accessing the corresponding shared memory page under the condition that the page of the UE fault address is a VDSO shared memory page and the UE memory fault is an SRAR type or the page of the UE fault address is a cmdline shared memory page;
and searching a process for accessing the VDSO shared memory page when detecting that the process accesses the VDSO shared memory page under the condition that the page of the UE fault address is the VDSO shared memory page and the UE memory fault is the SRAO type.
Preferably, when the UE memory failure is the SRAO type or the SRAR type, the ending the process of accessing the shared memory page includes:
and finishing the process of accessing the shared memory page by the active downtime.
Preferably, after the active downtime finishes the process of accessing the shared memory page and before the isolating the memory failure of the UE, the method further includes:
and skipping the memory data dump under the condition that the information for representing the memory data dump skipping is stored on the operating system dmesg log.
Preferably, after skipping the memory data dump and before isolating the UE memory failure, the method further includes:
and recording the related register information and the error address of the memory fault of the UE.
Preferably, when the page where the UE failure address is located is a cmdlene shared memory page, the ending the process of accessing the shared memory page includes:
adding a mandatory srar flag in an operating system;
and ending all processes establishing the mapping relation with the shared memory page through the forced srar mark.
Preferably, before the obtaining the UE memory failure address of the UE memory failure, the method further includes:
and adding APEI bypass processing in the operating system.
In order to solve the above technical problem, the present application further provides a device for recovering a memory failure of a UE, including:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a UE memory fault address of the UE memory fault when the UE memory fault is triggered;
a searching module, configured to search a process for accessing a shared memory page when a page of the UE memory fault address is the shared memory page;
a finishing module, configured to finish a process of accessing the shared memory page;
and the isolation module is used for isolating the memory fault of the UE.
Preferably, the UE memory failure recovery apparatus further includes:
and the skipping module is used for skipping the memory data dump under the condition that the operating system dmesg log has information for representing skipping the memory data dump.
Preferably, the UE memory failure recovery apparatus further includes:
and the recording module is used for recording the register information and the error address related to the memory failure of the UE.
Preferably, the apparatus for recovering from memory failure of UE further includes:
and the adding module is used for adding APEI bypass processing in the operating system.
In order to solve the above technical problem, the present application further provides an electronic device, including:
a memory for storing a computer program;
and the processor is used for realizing the steps of the UE memory failure recovery method when executing the computer program.
In order to solve the above technical problem, the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the method for recovering the UE memory failure are implemented.
In the method for recovering the memory fault of the UE, when the memory fault of the UE is triggered, the memory fault address of the UE with the memory fault of the UE is firstly obtained, then the process for accessing the shared memory page is searched under the condition that the page of the memory fault address of the UE is the shared memory page, and finally the process for accessing the shared memory page is ended and the memory fault of the UE is isolated. Therefore, in the method, under the condition that the page where the UE fault address is located is the shared memory page, the process triggering the UE memory fault is ended and isolated, so that the process triggering the UE memory fault does not trigger the UE memory fault any more, that is, the shared memory page is not marked any more, and thus other processes can acquire the shared memory page which is not isolated when accessing the memory, thereby ensuring that the system memory fault can be recovered in a short time and ensuring the normal operation of the system.
In addition, the application also provides a device for recovering the UE memory failure, electronic equipment and a computer storage medium, which correspond to the method for recovering the UE memory failure and have the same effects.
Drawings
In order to more clearly illustrate the embodiments of the present application, the drawings required for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained by those skilled in the art without inventive effort.
Fig. 1 is a flowchart of a UE memory failure recovery method provided in the present application;
fig. 2 is a flowchart of a common UE memory failure recovery method;
FIG. 3 is a flow chart of a UE memory failure triggering APEI to cause downtime;
FIG. 4 is a flow chart of a method of fault recovery with APEI bypass added;
fig. 5 is a structural diagram of a UE memory failure recovery apparatus according to an embodiment of the present application;
fig. 6 is a block diagram of an electronic device according to another embodiment of the present application;
fig. 7 is a flowchart of memory failure recovery enhancement in a VDSO scenario according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the present application.
The core of the application is to provide a UE memory failure recovery method, a UE memory failure recovery device, an electronic device and a medium, which are used for enhancing the UE memory failure recovery.
For ease of understanding, the hardware structure used in the technical solution of the present application is described below. The hardware architecture for processing the memory failure mainly comprises a processor, a hard disk, a register and the like. When the process accesses the memory, the memory fault of the UE may be triggered, and when the MCE accesses the hardware register, one register is the register containing the memory error address, the process accessing the error address is processed by analyzing the UE error address, and the memory fault is solidified and isolated according to the file system on the hard disk, so that the system can recover the memory fault in a short time.
In order that those skilled in the art will better understand the disclosure, the following detailed description will be given with reference to the accompanying drawings. Fig. 1 is a flowchart of a UE memory failure recovery method provided in the present application. As shown in fig. 1, the method includes:
s10: and when the memory fault of the UE is triggered, acquiring the memory fault address of the UE with the memory fault of the UE.
When an application accesses a memory, whether an Error occurs when the memory data is read is checked and judged through an Error Correction Code (ECC), and when a UE memory fault occurs, the UE memory fault needs to be recovered in order to prevent the UE memory fault from affecting a system.
Fig. 2 is a flowchart of a common UE memory failure recovery method. As shown in fig. 2, the server system is divided into a hardware layer 1 (platform layer), a firmware layer 2, an operating system layer 3, and an application layer 4. When an application accesses a memory to trigger a UE fault, under a firmware priority mode, a hardware platform sends System Management Interrupt (SMI) to firmware, and the firmware simultaneously sends Non-Maskable Interrupt (NMI) and MCE Interrupt to an operating System. The MCE interruption is a main flow for recovering the UE memory failure, the MCE interruption and do _ machine _ check are bound when an operating system is initialized, and the operating system calls the do _ machine _ check to perform UE memory failure isolation recovery after the MCE interruption is received. The MCE isolation process is mainly divided into three steps: firstly, obtaining the register information of a hardware platform, analyzing the fault level of a UE memory, then searching a process corresponding to a fault memory page using the UE, finally sending a SIGBUS signal to the corresponding process according to the fault type of the UE, ending the corresponding process and marking HWPoison on the fault memory page of the UE. After the UE memory fault isolation is recovered, the MCE records Error information to an mcelog log And updates Error Detection And Correction (EDAC) counts. For the NMI interruption processed by APEI/GHES, the memory fault level including CE memory fault and UE memory fault is analyzed in the processing process, when the UE memory fault is divided into a fault type hardware fault, if the fault type hardware fault is a fault type hardware fault, the fault type hardware fault is directly down. When a plurality of processes access the same shared memory page, and when one of the processes triggers a UE failure in the shared memory page, the specific failure recovery scheme is to isolate the shared memory page. Since the memory page is a shared memory page, the other processes may access the memory page. When other processes access the memory page, the shared memory page is isolated, so that the other processes acquire the isolated shared memory page and quit after receiving the SIGBUS signal, so that the system cannot be normally started or operated, and finally, the system memory fault cannot be recovered for a long time, and the system cannot normally operate. Therefore, the UE memory failure recovery needs to be enhanced, so that the system memory failure can be recovered in a short time, and thus, the UE memory failure recovery can operate normally.
In order to enhance UE memory failure recovery, a UE memory failure address needs to be resolved on a UE memory failure recovery path. When the MCE accesses the hardware register, one of the registers is the register containing the memory error address, so the UE memory failure address can be obtained from the register. The resolution of the memory fault address is specifically performed in Recovery at the operating system level.
S11: and searching the process of accessing the shared memory page under the condition that the page of the memory fault address of the UE is the shared memory page.
In the above step S10, the UE memory fault address is analyzed, and whether the page where the UE memory fault address is located is the shared memory page is determined. For the Shared memory page divided into the Shared memory page provided by the kernel and the common Shared memory page, for the Shared memory page provided by the kernel, such as a Virtual Dynamic linked library (VDSO) Shared memory page, because the Shared memory page is provided by the kernel, each process needs to access the Shared memory page when starting; for a common shared memory page, such as a cmdlet shared memory page, when multiple processes access the memory page at the same time, the memory page becomes a shared memory page. For the VDSO shared memory page, multiple applications may or may not access the memory page at the same time; for a common shared memory page, when one process accesses the memory page, there are other processes that are also accessing the memory page.
When the page of the UE memory fault address is not in the shared memory page, when one process triggers the UE memory fault, the fault address is isolated, so that the normal operation of other processes is not influenced, therefore, under the condition that the page of the UE memory fault address is not in the shared memory page, the existing UE fault recovery method is adopted; when a page where a UE memory failure address is located is a shared memory page, when a process triggers UE memory failure, the process needs to be processed, and the situation that other processes cannot normally operate is prevented. A process accessing a shared memory page can be looked up first.
S12: the process of accessing the shared memory page is terminated.
Processes using the shared memory page are found in the above steps, and because the processes are the reasons for triggering the memory failure of the UE, the processes accessing the shared memory page need to be terminated. The process of ending accessing the shared memory page is determined according to the influence range of the fault, and may be a process of ending the shared memory page through downtime, or a process of ending establishing a mapping relationship with the shared memory page, and the like.
S13: and isolating the memory failure of the UE.
After the isolation and recovery of the memory fault of the UE are carried out by the MCE, the error information is recorded into the mcelog log, so that the fault address is solidified and isolated according to the log, and the address which has triggered the memory fault of the UE does not trigger the memory fault of the UE any more. After the UE memory fault is isolated, because the fault address is isolated, the shared memory page is not isolated any more, when other processes access the shared memory page, the accessed shared memory page is not marked, so that the system can be started normally and run normally.
In the method for recovering the memory fault of the UE provided in this embodiment, when the memory fault of the UE is triggered, a memory fault address of the UE with the memory fault of the UE is first obtained, then, under the condition that a page of the memory fault address of the UE is a shared memory page, a process for accessing the shared memory page is searched, and finally, the process for accessing the shared memory page is ended and the memory fault of the UE is isolated. Therefore, in the method, under the condition that the page where the UE fault address is located is the shared memory page, the process triggering the UE memory fault is ended and isolated, so that the process triggering the UE memory fault does not trigger the UE memory fault any more, that is, the shared memory page is not marked any more, and thus other processes can acquire the shared memory page which is not isolated when accessing the memory, thereby ensuring that the system memory fault can be recovered in a short time and ensuring the normal operation of the system.
In the above embodiment, in the case that the page where the failed address is located is a shared memory page, the process using the shared memory page is searched for access. Because the influence ranges of different memory faults are different, the processes for searching and accessing the shared memory pages are different for the shared memory pages in different scenes. In an implementation, as a preferred mode, the process of searching for an access to a shared memory page includes:
searching a process for accessing the corresponding shared memory page under the condition that the page of the UE fault address is a VDSO shared memory page and the UE memory fault is an SRAR type or under the condition that the page of the UE fault address is a cmdline shared memory page;
and under the condition that the page where the UE fault address is located is a VDSO shared memory page and the UE memory fault is an SRAO type, searching a process for accessing the VDSO shared memory page when detecting that the process accesses the VDSO shared memory page.
The shared memory pages include shared memory pages in different scenarios, and the shared memory pages in common scenarios include VDSO shared memory pages and cmdline shared memory pages. And selecting a corresponding fault recovery mode according to different scenes.
In a VDSO scenario, when a VDSO shared memory page triggers a UE memory failure, a common UE memory failure recovery logic may send a SIGBUS signal to all processes accessing the VDSO page, and end the corresponding processes. The VDSO page is a basic memory shared by all application layers, if the HWPoison mark is marked after the memory _ failure processing, a subsequent new process is started to generate a missing page to obtain the HWPoison page, and then the SIGBUS exits. Therefore, when a memory failure occurs in the VDSO shared memory page, the new process cannot be started, and the system cannot operate normally. In order to recover the memory failure occurring in the VDSO shared memory page in a short time, so that the system can operate normally, the process triggering the failure needs to be processed. The process accessing the VDSO shared memory page is first looked up. It should be noted that, for the VDSO shared memory page, the type of the UE memory failure may be a Software Recoverable operation Optional (SRAO) type, or may be a Software Recoverable operation Required (SRAR) type, where an SRAO type error indicates that some data in the system is an SRAO type error indicating that some data in the system is corrupted but the data is not used, the operating system may select whether to execute a recovery Action according to a situation, and for the SRAR type error indicating that the data in the system is corrupted and the data is used, the operating system needs to perform a recovery operation. Therefore, when it is determined that the memory failure occurs in the VDSO shared memory page, it needs to further determine whether the memory failure is an SRAO type or an SRAR type. For an error of the SRAO type in the VDSO shared memory page, the recovery action may be performed or not, which is not limited herein. For an SRAO type error that does not perform a recovery action, the processing may be omitted first, and when it is detected that a process accesses the VDSO shared memory page, that is, data that indicates that the system has been damaged is used, at this time, in order to enable the system to operate normally, a memory failure needs to be recovered in time. In the process of recovering from memory failure, a process of accessing VDSO shared memory pages is first searched. For SRAR type memory failures, since data is corrupted and used, the memory failure needs to be recovered in time, and similarly, the process accessing the VDSO shared memory page is first searched.
In the cmdlene scenario, when multiple processes access a cmdlene memory page together, the cmdlene memory page is a shared memory page. When a memory fault of the UE exists on the cmdline shared memory page, for example, when the A, B and C processes access the cmdline memory page simultaneously, if the process B triggers the memory fault of the UE on the cmdline memory page, the operating system can determine that the memory fault of the UE at this time is unrelated to the process A and the process C, only the process B can be ended, and the memory page of the cmdline is isolated. Because the cmdline memory page is accessed by the three processes, namely the process A, the process B and the process C, if the page is isolated, the process A and the process C are caused to display abnormal (process parameters cannot be checked through ps commands and include specific IDs), if other monitoring processes detect the states of the process A and the process C by filtering cmdline keywords (such as the specific IDs) through the ps commands, the process A and the process C are considered to be abnormally terminated at the moment, the process A and the process C are tried to be pulled up again, but the process A and the process C still occupy resources such as network ports, and the failure of the process A and the process C pulled up by the monitoring processes is avoided, so that the normal operation of the system is influenced. In order to enable the memory fault occurring in the cmdlene shared memory page to be recovered within a short time and enable the system to operate normally, the process triggering the fault needs to be processed. When the memory fault is timely recovered, the process of accessing the cmdline shared memory page is searched first.
The process for respectively searching and accessing the corresponding shared memory pages for the memory faults of the UE in different scenarios is provided in this embodiment. Due to the fact that the UE memory faults under different scenes have different influence ranges, the UE memory faults under different scenes are correspondingly processed, and therefore corresponding measures can be adopted for recovering the memory faults under different scenes.
In an implementation, since the VDSO shared memory page is a shared memory page provided by the kernel, when a process accesses the VDSO shared memory page, the process will access the shared memory page first, and if the shared memory page is marked, the subsequent processes cannot be started normally. Therefore, the influence degree of the VDSO shared memory page triggering the UE memory failure is relatively serious. In implementation, as a preferred embodiment, when the memory failure of the UE is an SRAO type or an SRAR type, the ending the process of accessing the shared memory page includes:
and the active downtime finishes the process of accessing the shared memory pages.
For the memory fault of the UE of the SRAR type, the SRAR type indicates that the data is damaged and the data is used, so the process of accessing the shared memory page is finished through active downtime. For the memory fault of the SRAO type UE, because the SRAO type error indicates that some data in the system is damaged, but the data is not used, the memory fault of the SRAO type UE can be processed by neglecting firstly, and when the subsequent progress reading VDSO memory page data is delayed, the memory fault of the SRAR type UE is triggered to process the memory fault. The method for processing the memory fault is to actively shut down and finish the process of accessing the VDSO shared memory page. It should be noted that, in the above embodiment, for the SRAO-type UE memory failure, when a process detects that a process accesses a VDSO shared memory page, the process of accessing the VDSO shared memory page is started to search, so that the SRAO-type memory failure may be processed according to the SRAR-type memory failure, and the process of accessing the VDSO shared memory page is ended through active downtime.
In the case that the memory failure of the UE is of the SRAO type or the SRAR type, the process of accessing the shared memory page is terminated. Because the memory pages which are located in the VDSO shared memory pages are the memory pages which can be accessed when the processes are started, when the VDSO shared memory pages trigger UE memory faults, the influence degree is serious, the processes which access the shared memory pages are ended through active downtime, the UE memory faults are processed in time, and the system can run normally.
After triggering the memory failure downtime of the UE, the coredump automatically copies the memory data, and in the copying process, the address of the memory failure of the UE is accessed again, so that the memory failure is upgraded, and the problem of Internal Error (IERR) error is triggered. In an implementation, as a preferred embodiment, after the active downtime finishes the process of accessing the shared memory page and before isolating the UE memory failure, the method further includes:
in the case that there is information on the operating system dmesg log indicating that the memory data dump is skipped, the memory data dump is skipped.
After triggering the downtime of the UE memory fault, in order to prevent the coredump from automatically copying memory data, namely preventing the coredump from collecting memory images, before an operating system executes the downtime to call a downtime interface, a special mark 'Forbid: kdump-Core' is printed on a dmesg log of the operating system, then the operating system calls a downtime action executed by panic, a makedump file dumping process is triggered when the operating system is down, before dumping the memory data of the system, whether the operating system dmesg log has the special mark 'Forbid: kdump-Core' or not is judged, and if the special mark exists, memory data dumping is skipped.
After the active downtime finishes the process of accessing the shared memory page, and when the log has a special mark, the memory data dump is skipped, so that the problem that the memory fault is upgraded and the IEER error is triggered due to the fact that the memory data is automatically copied by the reduce after the memory fault downtime of the UE is triggered can be prevented.
On the basis of the embodiment, the system is prevented from being influenced again by the memory faults which occur, and the memory faults which occur are convenient to check. In an implementation, after skipping the memory data dump and before isolating the UE memory failure, the method further includes:
and recording the related register information and the error address of the memory fault of the UE.
When the MCE accesses the hardware registers, one of the registers is the register containing the memory error address, so that the UE memory fault address can be retrieved from the register. After skipping the memory data dump, the information of the register and the error address stored in the register are recorded. In the implementation, a HOOK is added on an execution path of the downtime caused by the failure recovery failure of the UE, and the HOOK redefines a coredump action, so that the downtime caused by the memory failure of the UE does not carry out memory data dump any more, and only register information and error addresses related to the memory failure are recorded.
According to the method and the device, after memory data dump is skipped, only relevant register information and error addresses of the UE memory faults are recorded, and the memory faults which occur can be conveniently checked through the recorded fault information, so that the memory faults which occur can be further isolated, and normal operation of a system is guaranteed as far as possible.
For the UE memory failure in different scenarios, the influence range of the failure is different, so the ways of ending the processes of accessing the shared memory page in different scenarios are different. In implementation, when the page of the UE failure address is a cmdlet shared memory page, the ending the process of accessing the shared memory page includes:
adding a mandatory srar flag in an operating system;
all processes establishing a mapping relationship with the shared memory page are finished by forcing the srar mark.
When the memory fault is in the VDSO shared memory page, since the processes need to access the shared memory page when starting, when the VDSO shared memory page is marked with the HWPoison mark, after isolation, the subsequent new processes cannot be started, and when the memory fault is in the cmdline shared memory page, since several processes are accessing the memory page at the same time, the rest processes which do not access the shared memory page can operate normally, and when the cmdline shared memory page is isolated, only the process using the shared memory page is affected, so that the memory fault in the VDSO scenario is seriously affected compared with the memory fault in the cmine shared memory scenario. In the above embodiment, for the memory fault in the VDSO scenario, the process of accessing the VDSO shared memory page is ended through active downtime, and in this embodiment, because the memory fault in the cmdline scenario has a small influence, only the process of accessing the cmdline shared memory page needs to be ended. And adding logic on the recovery logic of the operating system, and forcibly ending all processes establishing a mapping relation with the shared memory page when the memory fault of the UE is within the cmdlene range so as to prevent the processes in an abnormal state from occurring in the system. In implementation, the specific process for ending access to the shared memory page is to add a mandatory srar mark in the operating system; all processes establishing a mapping relationship with the shared memory page are finished by forcing the srar mark.
For the UE memory failure in the cmdlene scenario provided by this embodiment, all processes that establish a mapping relationship with a shared memory page are forcibly terminated, and a process in an abnormal state in the system is prevented from occurring, so that the system can normally operate.
In the above embodiment, the MCE interrupt processing is adopted to recover the memory failure, however, in the implementation, when the memory read-write operation is applied to trigger the UE memory failure, the bottom layer may generate the NMI interrupt and the MCE interrupt at the same time, and in the processing flow of the APEI, there is no capability of recovering the UE memory failure, which may directly cause the server downtime. Therefore, in order to prevent the UE memory failure from triggering the APEI to cause the downtime, as a preferred embodiment, before acquiring the UE memory failure address of the UE memory failure, the method further includes:
and adding APEI bypass processing in an operating system.
When the memory read-write operation triggers the memory fault of the UE, the bottom layer can simultaneously generate NMI interruption and MCE interruption. The NMI interrupts the processing flow corresponding to the APEI, and the MCE interrupts the recovery flow corresponding to the MCE. In the processing flow of the APEI, the server is shut down directly because of no capability of recovering the memory fault of the UE. The MCE has the recovery capability and can recover the memory failure of the UE within the support range. When the NMI interrupt and the MCE interrupt are sent simultaneously, the server triggers down when the server executes the APEI processing flow. Fig. 3 is a flowchart of a UE memory failure triggering an APEI to cause a downtime. As shown in fig. 3, when the application accesses the memory to trigger the UE failure, in the firmware priority mode, the hardware platform may send an SMI interrupt to the firmware, and the firmware sends an NMI interrupt and an MCE interrupt to the operating system at the same time. In the NMI interrupt processing process, the APEI is triggered to cause downtime, and for an operation that cannot perform fault Recovery in the MCE interrupt, as in fig. 3, an SIGBUS signal cannot be sent to the VM to end the process, and an operation of Recovery cannot be performed either.
In order to solve the problem of downtime caused by APEI priority processing, APEI bypass processing logic is added in the method, when the fact that the server supports Machine Check Architecture (MCA) is recognized, the downtime process of the APEI is bypassed, the server is enabled to directly enter the MCE UE memory failure recovery process, and the downtime probability caused by the UE memory failure is reduced. Fig. 4 is a flowchart of a failure recovery method with the addition of APEI bypass processing. As shown in fig. 4, after the APEI bypass process is added, the original APEI process is bypassed, and the MCE interrupt process can send a SIGBUS signal to the VM to end the process and perform the fault Recovery operation through Recovery.
According to the embodiment, before fault recovery is carried out by using MCE interrupt processing, APEI bypass processing is added in an operating system, and the server is prevented from triggering downtime when the server executes APEI processing, so that the capacity of the system for recovering memory faults is enhanced.
In the above embodiments, a UE memory failure recovery method is described in detail, and the present application also provides embodiments corresponding to a UE memory failure recovery apparatus and an electronic device. It should be noted that the present application describes the embodiments of the apparatus portion from two perspectives, one is from the perspective of the function module, and the other is from the perspective of the hardware.
Fig. 5 is a structural diagram of a UE memory failure recovery apparatus according to an embodiment of the present application. The present embodiment is based on the angle of the function module, including:
an obtaining module 10, configured to obtain a UE memory failure address of a UE memory failure when the UE memory failure is triggered;
a searching module 11, configured to search a process of accessing a shared memory page when a page of a memory fault address of the UE is the shared memory page;
an ending module 12, configured to end a process of accessing a shared memory page;
and the isolation module 13 is used for isolating the memory fault of the UE.
Since the embodiments of the apparatus portion and the method portion correspond to each other, please refer to the description of the embodiments of the method portion for the embodiments of the apparatus portion, which is not repeated here.
In addition, the UE memory failure recovery apparatus further includes: the system comprises a skipping module, a memory data dump module and a data storage module, wherein the skipping module is used for skipping memory data dump under the condition that the operating system dmesg logs have information for representing skipping of memory data dump; the recording module is used for recording the related register information and the error address of the memory fault of the UE; and the adding module is used for adding APEI bypass processing in the operating system.
When a memory failure of the UE is triggered, the UE memory failure recovery apparatus provided in this embodiment first obtains a memory failure address of the UE with the memory failure through an obtaining module; then, under the condition that the page of the memory fault address of the UE is a shared memory page, searching a process for accessing the shared memory page through a searching module; after the process of accessing the shared memory page is found, the process of accessing the shared memory page is ended through an ending module; and finally, isolating the memory fault of the UE through an isolation module. Therefore, in the device, under the condition that the page where the UE fault address is located is the shared memory page, the process triggering the UE memory fault is ended and isolated, so that the process triggering the UE memory fault does not trigger the UE memory fault any more, that is, the shared memory page is not marked any more, and thus other processes can acquire the shared memory page which is not isolated when accessing the memory, thereby ensuring that the system memory fault can be recovered in a short time and ensuring the normal operation of the system.
Fig. 6 is a block diagram of an electronic device according to another embodiment of the present application. This embodiment is based on a hardware perspective, and as shown in fig. 6, the electronic device includes:
a memory 20 for storing a computer program;
the processor 21 is configured to implement the steps of the method for UE memory failure recovery as mentioned in the above embodiments when executing the computer program.
The electronic device provided by the embodiment may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, or a desktop computer.
The processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The Processor 21 may be implemented in at least one hardware form of a Digital Signal Processor (DSP), a Field-Programmable Gate Array (FPGA), and a Programmable Logic Array (PLA). The processor 21 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a Graphics Processing Unit (GPU) which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 21 may further include an Artificial Intelligence (AI) processor for processing computational operations related to machine learning.
The memory 20 may include one or more computer-readable storage media, which may be non-transitory. Memory 20 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 20 is at least used for storing the following computer program 201, wherein after being loaded and executed by the processor 21, the computer program can implement the relevant steps of the method for recovering the UE memory failure disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 20 may also include an operating system 202, data 203, and the like, and the storage manner may be a transient storage manner or a permanent storage manner. Operating system 202 may include, among others, windows, unix, linux, and the like. The data 203 may include, but is not limited to, data related to the above-mentioned UE memory failure recovery method, and the like.
In some embodiments, the electronic device may further include a display 22, an input/output interface 23, a communication interface 24, a power supply 25, and a communication bus 26.
Those skilled in the art will appreciate that the configuration shown in fig. 6 is not limiting to electronic devices and may include more or fewer components than those shown.
The electronic device provided by the embodiment of the application comprises a memory and a processor, and when the processor executes a program stored in the memory, the following method can be realized: the effect of the UE memory failure recovery method is the same as that of the UE memory failure recovery method.
Finally, the application also provides a corresponding embodiment of the computer readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps as set forth in the above-mentioned method embodiments.
It is understood that, if the method in the above embodiments is implemented in the form of software functional units and sold or used as a stand-alone product, it can be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium and executes all or part of the steps of the methods described in the embodiments of the present application, or all or part of the technical solutions. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The computer-readable storage medium provided by the present application includes the above-mentioned method for recovering from a memory failure of a UE, and the effect is the same as above.
In order to enable those skilled in the art to better understand the technical solution of the present application, the following describes the present application in further detail with reference to fig. 7, and fig. 7 is a flowchart of the memory failure recovery enhancement in a VDSO scenario provided in the embodiment of the present application. The method specifically comprises the following steps:
s14: triggering a UE memory fault;
s15: resolving the error address of the UE;
s16: judging whether a VDSO page exists or not; if not, the process goes to step S17, and then to step S21; if not, the step S18 is carried out;
s17: resolving a UE error address;
s18: judging whether SRAO is available; if not, the process proceeds to step S19, and if so, the process proceeds to step S21;
s19: performing SRAR downtime treatment;
s20: skipping memory data dump;
s21: and isolating the memory fault of the UE.
In the method for enhancing memory failure recovery in a VDSO scenario provided in this embodiment, when a UE memory failure is triggered, a UE address is analyzed on a UE memory failure recovery path, and whether the UE address falls on a VDSO page is determined. And for the SRAR type errors falling on the VDSO, the system is down actively, and the generation of a memory dump file is avoided. And ignoring the processing for the SRAO type errors falling in the VDSO, and delaying to trigger the downtime processing of the memory failure of the SRAR type UE when a subsequent process reads the VDSO memory page data. Therefore, in the method, under the condition that the page where the UE fault address is located is the VDSO shared memory page, active downtime is performed on the SRAR type errors, so that the process which has triggered the UE memory fault does not trigger the UE memory fault any more, that is, the shared memory page is not marked any more, and other processes can acquire the non-isolated shared memory page when accessing the memory, thereby ensuring that the system memory fault can be recovered in a short time and ensuring the normal operation of the system.
The method, the device, the electronic device and the medium for recovering the memory failure of the UE provided by the present application are described in detail above. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.
It should also be noted that, in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A method for recovering memory failure of UE (user equipment) is characterized by comprising the following steps:
when a UE memory fault is triggered, acquiring a UE memory fault address of the UE memory fault;
searching a process for accessing the shared memory page under the condition that the page of the UE memory fault address is the shared memory page;
ending the process of accessing the shared memory page;
and isolating the memory fault of the UE.
2. The method according to claim 1, wherein the process of searching for access to the shared memory page includes:
searching a process for accessing the corresponding shared memory page under the condition that the page of the UE fault address is a VDSO shared memory page and the UE memory fault is an SRAR type or the page of the UE fault address is a cmdline shared memory page;
and searching a process for accessing the VDSO shared memory page when detecting that the process accesses the VDSO shared memory page under the condition that the page of the UE fault address is the VDSO shared memory page and the UE memory fault is the SRAO type.
3. The method according to claim 2, wherein when the UE memory failure is of the SRAO type or the SRAR type, the ending the process of accessing the shared memory page includes:
and the active downtime finishes the process of accessing the shared memory pages.
4. The method according to claim 3, wherein after the active downtime ends the process of accessing the shared memory pages and before the isolating the UE memory failure, further comprising:
and skipping the memory data dump under the condition that the information for representing the memory data dump skipping is stored on the operating system dmesg log.
5. The method of UE memory failure recovery according to claim 4, wherein after the skipping the memory data dump and before the isolating the UE memory failure, further comprising:
and recording the related register information and the error address of the memory fault of the UE.
6. The method according to claim 2, wherein, when the page where the UE failure address is located is a cmdlet shared memory page, the ending of the process of accessing the shared memory page includes:
adding a mandatory srar flag in an operating system;
and ending all processes establishing the mapping relation with the shared memory page through the forced srar mark.
7. The method according to any one of claims 1 to 6, wherein before the obtaining the UE memory failure address of the UE memory failure, the method further comprises:
and adding APEI bypass processing in the operating system.
8. An apparatus for recovering memory failure of a UE, comprising:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a UE memory fault address of the UE memory fault when the UE memory fault is triggered;
a searching module, configured to search a process for accessing a shared memory page when a page of the UE memory failure address is the shared memory page;
a termination module, configured to terminate a process of accessing the shared memory page;
and the isolation module is used for isolating the memory fault of the UE.
9. An electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method for UE memory failure recovery according to any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for UE memory failure recovery according to any one of claims 1 to 7.
CN202211193186.8A 2022-09-28 2022-09-28 UE memory fault recovery method, device, electronic equipment and medium Pending CN115421960A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211193186.8A CN115421960A (en) 2022-09-28 2022-09-28 UE memory fault recovery method, device, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211193186.8A CN115421960A (en) 2022-09-28 2022-09-28 UE memory fault recovery method, device, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN115421960A true CN115421960A (en) 2022-12-02

Family

ID=84207149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211193186.8A Pending CN115421960A (en) 2022-09-28 2022-09-28 UE memory fault recovery method, device, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN115421960A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116126581A (en) * 2023-04-10 2023-05-16 阿里云计算有限公司 Memory fault processing method, device, system, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116126581A (en) * 2023-04-10 2023-05-16 阿里云计算有限公司 Memory fault processing method, device, system, equipment and storage medium
CN116126581B (en) * 2023-04-10 2023-09-01 阿里云计算有限公司 Memory fault processing method, device, system, equipment and storage medium

Similar Documents

Publication Publication Date Title
US6622260B1 (en) System abstraction layer, processor abstraction layer, and operating system error handling
US8250543B2 (en) Software tracing
US9411689B2 (en) Method and relevant apparatus for starting boot program
US7900090B2 (en) Systems and methods for memory retention across resets
US20100223498A1 (en) Operating system-based application recovery
US9262283B2 (en) Method for reading kernel log upon kernel panic in operating system
US20060041739A1 (en) Memory dump generation with quick reboot
CN111383031B (en) Intelligent contract execution method and system in block chain and electronic equipment
US20120030766A1 (en) Method and system for defining a safe storage area for use in recovering a computer system
EP3274839B1 (en) Technologies for root cause identification of use-after-free memory corruption bugs
CN115421984A (en) Memory fault processing method and device, electronic equipment and medium
CN115421960A (en) UE memory fault recovery method, device, electronic equipment and medium
CN104866388B (en) Data processing method and device
CN110928720A (en) Core dump file generation method and device based on Linux system
CN114385418A (en) Protection method, device, equipment and storage medium for communication equipment
US9772892B2 (en) Recovery method for portable touch-control device and portable touch-control device using the same
CN113536320A (en) Error information processing method, device and storage medium
CN116126581B (en) Memory fault processing method, device, system, equipment and storage medium
US20110202903A1 (en) Apparatus and method for debugging a shared library
CN109634782B (en) Method and device for detecting system robustness, storage medium and terminal
US11630714B2 (en) Automated crash recovery
CN115168119A (en) PCIE (peripheral component interface express) link detection method, device and medium for server
US10768940B2 (en) Restoring a processing unit that has become hung during execution of an option ROM
US10592329B2 (en) Method and electronic device for continuing executing procedure being aborted from physical address where error occurs
JP6164283B2 (en) Software safe stop system, software safe stop method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination