CN116841789A - Memory fault processing method, device, equipment and medium - Google Patents

Memory fault processing method, device, equipment and medium Download PDF

Info

Publication number
CN116841789A
CN116841789A CN202310796615.9A CN202310796615A CN116841789A CN 116841789 A CN116841789 A CN 116841789A CN 202310796615 A CN202310796615 A CN 202310796615A CN 116841789 A CN116841789 A CN 116841789A
Authority
CN
China
Prior art keywords
memory
server
fault
memories
capacity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310796615.9A
Other languages
Chinese (zh)
Inventor
张国奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Shandong Computer Technology Co Ltd
Original Assignee
Inspur Shandong Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Shandong Computer Technology Co Ltd filed Critical Inspur Shandong Computer Technology Co Ltd
Priority to CN202310796615.9A priority Critical patent/CN116841789A/en
Publication of CN116841789A publication Critical patent/CN116841789A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

The invention discloses a memory fault processing method, device, equipment and medium, and relates to the technical field of servers. The method comprises the steps that when a server detects that a memory fault exists after power-on, overall memory information of all memories in the server and fault memory information of a fault memory are obtained; judging whether the fault memory information meets preset conditions according to the overall memory information; if not, skipping the operation fault memory, and initializing the rest memories according to the current memory initialization logic to restart the server. Therefore, when the memory fault is detected, the influence condition of the fault memory on the starting of the server is determined by acquiring the whole memory information of all memories in the server and the fault memory information of the fault memory by taking the preset condition as a reference; and particularly, when the preset condition is not met, the fault memory is skipped to run, so that the server can be restarted normally, and the problem that the server cannot be started normally due to the memory fault when the memory isolation is executed by the server is solved.

Description

Memory fault processing method, device, equipment and medium
Technical Field
The present invention relates to the field of server technologies, and in particular, to a method, an apparatus, a device, and a medium for processing a memory failure.
Background
The current mainstream server is the Intel x86 architecture family. With the development of the age, the update of server devices is accelerated, and a new server architecture is urgently needed. Thus, an advanced reduced instruction set machine (Advanced RISC Machine, ARM) architecture has evolved. In recent years, servers of ARM architecture have been developed rapidly.
However, there is a gap between the current ARM server and the ecological environment of the mature Intel x86 server; after the ARM server uses a memory isolation technique similar to that of the x86 server, a memory failure of the server may occur, which may cause a problem that the server cannot be started or is down.
In view of the above-mentioned problems, how to solve the problem that the server cannot be started normally due to the memory failure occurring when the server performs the memory isolation is a urgent problem for those skilled in the art.
Disclosure of Invention
The invention aims to provide a memory fault processing method, device, equipment and medium, which are used for solving the problem that a server cannot be started normally due to memory faults when the server executes memory isolation.
In order to solve the above technical problems, the present invention provides a memory failure processing method, including:
When a server detects that a memory fault exists after power-on, acquiring the overall memory information of all memories in the server; wherein the overall memory information comprises the total memory quantity and total capacity;
acquiring fault memory information of the fault memory in the server; wherein the fault memory information comprises the number of the fault memories and the capacity of the fault memories;
judging whether the fault memory information meets a preset condition according to the whole memory information;
if not, skipping operation of the fault memory, and initializing the rest memories according to the current memory initialization logic so as to restart the server.
In one aspect, the determining whether the fault memory information meets a preset condition according to the overall memory information includes:
judging whether the ratio of the number of the fault memories to the number of all the memories is not greater than a first threshold value, and the ratio of the capacity of the fault memories to the capacity of all the memories is not greater than a second threshold value;
if not, the step of skipping the operation of the fault memory is carried out, wherein the preset condition is not met.
On the other hand, after the skipping operation of the failed memory and initializing the rest of the memories according to the current memory initialization logic, the method further includes:
Acquiring the capacity of other memories according to the total capacity of all memories and the capacity of the fault memory; the other memories are memories except the fault memory in all the memories;
acquiring the system capacity of the server according to the quantity of other memories and the capacity of the other memories;
and distributing the system capacity of the server to a system of the server.
In another aspect, the method further comprises:
judging whether the ratio of the number of the fault memories to the number of all the memories is not greater than a third threshold value, and the ratio of the capacity of the fault memories to the capacity of all the memories is not greater than a fourth threshold value;
if yes, changing the current memory initialization logic;
wherein the third threshold is less than the first threshold and the fourth threshold is less than the second threshold.
On the other hand, after the changing the current memory initialization logic, the method further includes:
acquiring a memory capacity allocation ratio of the server before restarting;
controlling the server to restart;
acquiring the memory capacity distribution ratio after restarting the server;
judging whether the memory capacity distribution ratio after restarting the server is not more than the memory capacity distribution ratio before restarting the server;
If yes, performing memory virtual expansion on the capacity of each memory;
performing over-frequency processing on a central processing unit of the server;
distributing the capacity of each memory after virtual expansion of the memory according to the memory capacity distribution ratio before restarting the server;
if not, performing capacity deployment and regulation on the system capacity of the server according to the memory capacity distribution ratio after restarting the server.
In another aspect, the method further comprises:
judging whether the ratio of the number of the fault memories to the number of all the memories is not greater than a fifth threshold value, and the ratio of the capacity of the fault memories to the capacity of all the memories is not greater than a sixth threshold value;
if not, confirming that the fault memory ratio exceeds the limit, and sending a prompt signal to a baseboard management controller of the server;
wherein the fifth threshold is greater than the first threshold and the sixth threshold is greater than the second threshold.
On the other hand, after restarting the server, further comprising:
outputting information of successful restarting of the server;
and updating the running log of the server to record the restart reason of the server.
In order to solve the technical problem, the present invention further provides a memory failure processing device, including:
The first acquisition module is used for acquiring the whole memory information of all memories in the server when the memory fault is detected after the server is electrified; wherein the overall memory information comprises the total memory quantity and total capacity;
the second acquisition module is used for acquiring the fault memory information of the fault memory in the server; wherein the fault memory information comprises the number of the fault memories and the capacity of the fault memories;
the judging module is used for judging whether the fault memory information meets preset conditions according to the whole memory information; if not, triggering an initialization module;
the initialization module is used for skipping operation of the fault memory, initializing the rest memories according to the current memory initialization logic, and restarting the server.
In order to solve the above technical problem, the present invention further provides a memory failure processing device, including:
a memory for storing a computer program;
and the processor is used for realizing the steps of the memory fault processing method when executing the computer program.
In order to solve the above technical problem, the present invention further provides a computer readable storage medium, where a computer program is stored, where the computer program implements the steps of the memory failure processing method when executed by a processor.
According to the memory fault processing method provided by the invention, when the memory fault is detected after the server is electrified, the whole memory information of all memories in the server is obtained; the overall memory information comprises the number and the total capacity of all memories; acquiring fault memory information of a fault memory in a server; the fault memory information comprises the number of the fault memories and the capacity of the fault memories; judging whether the fault memory information meets preset conditions according to the overall memory information; if not, skipping the operation fault memory, and initializing the rest memories according to the current memory initialization logic to restart the server. Therefore, when the server detects the memory fault, the scheme determines the influence condition of the fault memory on the starting of the server by acquiring the whole memory information of all memories in the server and the fault memory information of the fault memory and taking the preset condition as a reference; specifically, when the preset condition is not met, the fault memory is skipped to run, so that the influence on the running of the server is avoided, the server can be restarted normally, and the problem that the server cannot be started normally due to the memory fault when the memory isolation is executed by the server is solved.
In addition, the invention also provides a memory fault processing device, equipment and medium, and the effects are the same as above.
Drawings
For a clearer description of embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.
FIG. 1 is a flowchart of a memory failure processing method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a memory failure processing apparatus according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a memory failure processing device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present invention.
The invention provides a memory fault processing method, a device, equipment and a medium, which aim to solve the problem that a server cannot be started normally due to memory faults when the server performs memory isolation.
In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description.
Currently, although the main stream servers are mainly Intel x86 architecture series, development of Intel x86 architecture servers has been stopped in recent years. Because of the need of the development of the era, the update speed of the server device is increased, and a new form of server architecture is urgently needed at present, so that the server of the ARM architecture is generated.
In recent years, servers of ARM architecture have evolved dramatically. But the ARM architecture server has a certain gap relative to the ecological environment of the mature Intel x86 server. For example, when a server of the ARM architecture uses a memory isolation technique similar to the Intel x86 server (the memory isolation technique is a security mechanism that makes the system more secure and reliable in operation by isolating the memory spaces of different processes or users), a failed memory Training (i.e., a memory failure condition) may occur, which may cause the server to fail to start. Therefore, in order to solve the above-mentioned problems, the present invention provides a memory failure processing method, which aims to solve the problem that a server cannot be started normally due to a memory failure when the server performs memory isolation. It can be understood that the method provided by the invention is mainly applied to servers, particularly servers using memory isolation technology, including but not limited to ARM servers using memory isolation technology.
Fig. 1 is a flowchart of a memory failure processing method according to an embodiment of the present invention. As shown in fig. 1, the method includes:
s10: and when the server detects that the memory fault exists after the power-on, acquiring the whole memory information of all the memories in the server.
The overall memory information includes the total memory quantity and total capacity.
Specifically, when a server (for example, FT2000+ server) of an ARM architecture using a memory isolation technique is powered on and booted, the server detects whether a memory failure occurs in itself, i.e., determines whether a memory failure occurs. And when the server detects that the memory fault exists after the power-on, acquiring the whole memory information of all the memories in the server.
It is understood that there are typically multiple memory banks in the server, and the capacity of each memory may be the same size or different sizes. And acquiring the overall memory information of all the memories in the server, namely taking all the memories as a whole, and acquiring the memory information of the whole memory. The overall memory information includes the total memory quantity and total capacity. It should be noted that the total capacity of the total memory is the sum of the capacities of the memories in the server.
S11: and acquiring fault memory information of the fault memory in the server.
The fault memory information comprises the number of the fault memories and the capacity of the fault memories.
It will be appreciated that when the server detects a memory failure condition, the failed memory can be located. Further, it is necessary to obtain failure memory information of the failure memory in the server. The fault memory information comprises the number of the fault memories and the capacity of the fault memories.
It should be noted that the number of the fault memories may be one or more, which is not limited in this embodiment, and depends on the specific implementation. Therefore, when the number of the fault memories is plural, in the obtained fault memory information, the capacity of the fault memory is specifically the sum of the capacities of the fault memories.
S12: judging whether the fault memory information meets preset conditions according to the overall memory information; if not, the process proceeds to step S13. If yes, the process proceeds to step S14.
S13: and skipping the operation fault memory, and initializing the rest memories according to the current memory initialization logic so as to restart the server.
S14: the server is restarted.
After the overall memory information of all memories of the server and the fault memory information of the fault memory are respectively obtained, judging whether the fault memory information meets preset conditions or not according to the overall memory information. It can be understood that the preset condition in this embodiment is used as a criterion to determine the specific influence condition of the fault memory on the server start. In this embodiment, the preset conditions are not limited, and specific condition limits may be set according to the number of memories, and specific condition limits may also be set according to the memory capacity, which is determined according to specific implementation situations.
When the fault memory information is confirmed to not meet the preset condition according to the whole memory information, the fault memory is considered to have larger influence on the server, the fault memory is skipped to operate at the moment, other memories are initialized according to the current memory initialization logic, so that the server is restarted, and the normal starting of the server is completed. And when the fault memory information is confirmed to meet the preset condition according to the whole memory information, the influence of the fault memory is considered to be smaller, and the server can be directly restarted.
In this embodiment, when a memory failure is detected after a server is powered on, overall memory information of all memories in the server is obtained; the overall memory information comprises the number and the total capacity of all memories; acquiring fault memory information of a fault memory in a server; the fault memory information comprises the number of the fault memories and the capacity of the fault memories; judging whether the fault memory information meets preset conditions according to the overall memory information; if not, skipping the operation fault memory, and initializing the rest memories according to the current memory initialization logic to restart the server. Therefore, when the server detects the memory fault, the scheme determines the influence condition of the fault memory on the starting of the server by acquiring the whole memory information of all memories in the server and the fault memory information of the fault memory and taking the preset condition as a reference; specifically, when the preset condition is not met, the fault memory is skipped to run, so that the influence on the running of the server is avoided, the server can be restarted normally, and the problem that the server cannot be started normally due to the memory fault when the memory isolation is executed by the server is solved.
In order to determine whether the failure memory information of the failure memory meets the preset condition, based on the foregoing embodiment, as a preferred embodiment, determining whether the failure memory information meets the preset condition according to the overall memory information includes:
s120: judging whether the ratio of the number of the fault memories to the number of all memories is not greater than a first threshold value, and the ratio of the capacity of the fault memories to the capacity of all memories is not greater than a second threshold value; if not, the preset condition is not satisfied, and the process proceeds to step S13.
In a specific implementation, the number and the total capacity of all memories and the number and the capacity of the fault memories are compared correspondingly. Specifically, whether the ratio of the number of the failed memories to the number of the total memories is not greater than a first threshold value or not is judged, and the ratio of the capacity of the failed memories to the capacity of the total memories is not greater than a second threshold value. It should be noted that, in this embodiment, the first threshold value and the second threshold value are not limited, and depend on the specific implementation. Further, since the first threshold is a number of thresholds and the second threshold is a capacity threshold, there is no size relationship between the first threshold and the second threshold.
When it is determined that the ratio of the number of the failed memories to the number of the total memories is greater than the first threshold, or it is determined that the ratio of the number of the failed memories to the number of the total memories is greater than the second threshold, or it is determined that the ratio of the number of the failed memories to the number of the total memories is greater than the first threshold and the ratio of the number of the failed memories to the number of the total memories is greater than the second threshold, the preset condition is not satisfied, and at this time, the failed memories have a greater influence on the server, and step S13 is required to be entered to skip the failed memories.
In this embodiment, based on the corresponding comparison of the number and the total capacity of all the memories and the number and the capacity of the failed memories, the judgment of whether the failed memory information meets the preset condition is implemented, and the accuracy of the judgment is improved.
When the failed memory is skipped, the remaining memory is initialized using the current memory initialization logic, but the failed memory is skipped, its memory capacity is also counted as the overall system capacity. If the allocation of the system capacity according to the original initialization logic cannot meet the same project deployment, a problem occurs in the server memory deployment, so that the system capacity needs to be further reallocated. Specifically, based on the above embodiment, as a preferred embodiment, after skipping the operation failure memory and initializing the rest of the memories according to the current memory initialization logic, the method further includes:
s15: and acquiring the capacity of other memories according to the total capacity of all memories and the capacity of the fault memory.
The other memories are memories except the fault memory in all memories.
S16: and acquiring the system capacity of the server according to the quantity of the other memories and the capacity of the other memories.
S17: the system capacity of the server is allocated to the system of the server.
In a specific implementation, the capacity of other memories is obtained according to the total capacity of all memories and the capacity of the fault memory. It is understood that the other memories are memories except for the fault memory in the whole memories, and the capacity of the other memories is the capacity of each remaining usable memory in the whole memories. And obtaining the total sum of the residual memory capacities which can be used currently according to the quantity of other memories and the capacities of other memories, dividing the total sum of the residual memory capacities to enable the server to meet the system capacity under the same project deployment, and finally distributing the system capacity to the system of the server so as to meet the requirement of unified deployment of the system as far as possible.
In this embodiment, the system capacity of the server is further obtained by obtaining the capacities of other memories according to the total capacity of all memories and the capacities of the failed memories, and the system capacity of the server is distributed to the system of the server, so as to meet the requirement of uniform deployment of the server system.
In order to achieve higher performance of the server memory, the method further includes, as a preferred embodiment, on the basis of the foregoing embodiment:
s18: judging whether the ratio of the number of the fault memories to the number of all memories is not greater than a third threshold value, and the ratio of the capacity of the fault memories to the capacity of all memories is not greater than a fourth threshold value; if yes, the process proceeds to step S19.
S19: the current memory initialization logic is changed.
Wherein the third threshold is less than the first threshold and the fourth threshold is less than the second threshold.
In a specific implementation, it is determined whether a ratio of the number of failed memories to the number of all memories is not greater than a third threshold, and a ratio of a capacity of the failed memories to a capacity of all memories is not greater than a fourth threshold. In this embodiment, the third threshold and the fourth threshold are not limited, and they are determined according to specific implementation cases. Since the third threshold is a number threshold and the fourth threshold is a capacity threshold, there is no size relationship between the third threshold and the fourth threshold. It should be noted that the third threshold is smaller than the first threshold and the fourth threshold is smaller than the second threshold.
When the ratio of the number of the fault memories to the number of all memories is not greater than the third threshold value and the ratio of the capacity of the fault memories to the capacity of all memories is not greater than the fourth threshold value, the fault memories are considered to have smaller influence on the server at the moment, and the current memory initialization logic can be changed, so that the performance of the server memories reaches higher service performance.
In implementations, changing the memory initialization logic may cause some problems with a small probability. For example, after a server reboot, the memory capacity available to the server system is inconsistent with the memory capacity last started by the server, which may result in failure of the server performance test or other tests to meet the memory capacity requirements. To avoid this problem, based on the above embodiment, as a preferred embodiment, after changing the current memory initialization logic, the method further includes:
S20: and acquiring the memory capacity distribution ratio of the server before restarting.
S21: the control server is restarted.
S22: and obtaining the memory capacity distribution ratio after restarting the server.
S23: judging whether the memory capacity distribution ratio after restarting the server is not more than the memory capacity distribution ratio before restarting the server; if yes, go to step S24; if not, the process proceeds to step S27.
S24: and performing memory virtual expansion on the capacity of each memory.
S25: and performing over-frequency processing on the central processing unit of the server.
S26: and allocating the capacity of each memory after the virtual expansion of the memory according to the memory capacity allocation ratio before restarting the server.
S27: and performing capacity deployment and regulation on the system capacity of the server according to the memory capacity distribution ratio after restarting the server.
Firstly, the memory capacity allocation ratio of the server before restarting is obtained through the firmware of a basic input output system (Basic Input Output System, BIOS) of the server. And restarting the control server, and recording the memory capacity allocation ratio after restarting the server. And further taking the restarted memory capacity distribution condition as a base point, and judging whether the memory capacity distribution ratio after restarting the server is not larger than the memory capacity distribution ratio before restarting the server.
If the memory capacity allocation ratio after restarting the server is not greater than the memory capacity allocation ratio before restarting the server, in order to meet the requirement of the lack of memory capacity, virtual memory expansion is needed for each memory capacity, and simultaneously, over-frequency processing is performed for a central processing unit of the server so as to adapt to the virtual memory expansion. And finally, allocating the capacity of each memory after virtual expansion of the memory according to the memory capacity allocation ratio before restarting of the server so as to maintain a reasonable memory allocation ratio.
If the memory capacity allocation ratio after restarting of the server is confirmed to be larger than the memory capacity allocation ratio before restarting of the server, in order to keep consistency of the memory capacity allocation ratios before and after restarting, the memory capacity allocation ratio after restarting of the server needs to be counted, and overall capacity deployment and regulation are carried out on the system capacity of the server according to the memory capacity allocation ratio after restarting of the server.
In this embodiment, by recording the memory allocation ratio before and after restarting the server, different memory allocation modes are adopted according to the size of the two, so as to meet the requirements of server performance test or other tests on memory capacity.
In order to prompt the server and the user when the operation of the server is seriously affected by the failed memory, the method further comprises, based on the above embodiment, as a preferred embodiment:
S28: judging whether the ratio of the number of the fault memories to the number of all memories is not greater than a fifth threshold value, and the ratio of the capacity of the fault memories to the capacity of all memories is not greater than a sixth threshold value; if not, the process proceeds to step S29.
S29: confirming that the fault memory duty ratio exceeds the limit, and sending a prompt signal to a baseboard management controller of the server.
Wherein the fifth threshold is greater than the first threshold and the sixth threshold is greater than the second threshold.
In a specific implementation, it is determined whether a ratio of the number of failed memories to the number of all memories is not greater than a fifth threshold, and a ratio of a capacity of the failed memories to a capacity of all memories is not greater than a sixth threshold. In this embodiment, the fifth threshold and the sixth threshold are not limited, and they are determined according to the specific implementation. Since the fifth threshold is a number of thresholds and the sixth threshold is a capacity threshold, there is no size relationship between the fifth threshold and the sixth threshold. It should be noted that the fifth threshold is greater than the first threshold and the sixth threshold is greater than the second threshold.
If the ratio of the number of the fault memories to the number of the all memories is greater than the fifth threshold, or the ratio of the capacity of the fault memories to the capacity of the all memories is greater than the sixth threshold, or the ratio of the number of the fault memories to the number of the all memories is greater than the fifth threshold and the ratio of the capacity of the fault memories to the capacity of the all memories is greater than the sixth threshold, the ratio of the fault memories is considered to be over-limited, the operation of the server is seriously affected, and the operation ecology of the server cannot be matched. At this time, a prompt signal needs to be sent to a baseboard management controller (Baseboard Management Controller, BMC) of the server, so that the system and a user can acquire the memory fault condition, remind the user of having to replace the fault memory, and avoid the greater influence caused by the fault memory.
In addition, in order to enable the user to acquire the server restart and the related information thereof in time, on the basis of the above embodiment, as a preferred embodiment, after restarting the server, the method further includes:
s30: and outputting information of successful restarting of the server.
S31: the running log of the server is updated to record the cause of the server restart.
Specifically, after restarting the server, outputting information that the server is restarted successfully. In this embodiment, the specific form of the output information is not limited, and may be audio prompt information or light prompt information, according to specific implementation conditions. The running log of the server is further updated to record the restarting reason of the server, so that a user can find the restarting reason of the server in the log to accurately judge the running condition of the server.
In the above embodiments, the detailed description is given to the memory failure processing method, and the present invention further provides a corresponding embodiment of the memory failure processing apparatus.
Fig. 2 is a schematic diagram of a memory failure processing apparatus according to an embodiment of the present invention. As shown in fig. 2, the memory failure processing apparatus includes:
the first obtaining module 10 is configured to obtain overall memory information of all memories in the server when a memory failure is detected after the server is powered on; the overall memory information includes the total memory quantity and total capacity.
A second obtaining module 11, configured to obtain failure memory information of a failure memory in the server; the fault memory information comprises the number of the fault memories and the capacity of the fault memories.
The judging module 12 is configured to judge whether the failure memory information meets a preset condition according to the overall memory information; if not, triggering an initialization module.
And the initialization module 13 is used for skipping the operation failure memory, initializing the rest memories according to the current memory initialization logic, and restarting the server.
As a preferred embodiment, the judging module includes:
the first threshold judging module is used for judging whether the ratio of the number of the fault memories to the number of all memories is not larger than a first threshold, and the ratio of the capacity of the fault memories to the capacity of all memories is not larger than a second threshold; if not, the preset condition is not met, and the initialization module is triggered.
As a preferred embodiment, further comprising:
the third obtaining module is used for obtaining the capacity of other memories according to the total capacity of all memories and the capacity of the fault memory after the rest memories are initialized according to the current memory initialization logic after skipping operation of the fault memory; wherein, the other memories are memories except the fault memory in all memories;
The fourth acquisition module is used for acquiring the system capacity of the server according to the number of other memories and the capacity of the other memories;
and the first allocation module is used for allocating the system capacity of the server to the system of the server.
As a preferred embodiment, further comprising:
the second threshold judging module is used for judging whether the ratio of the number of the fault memories to the number of all memories is not larger than a third threshold, and the ratio of the capacity of the fault memories to the capacity of all memories is not larger than a fourth threshold; if yes, triggering a change module;
the modification module is used for modifying the current memory initialization logic;
wherein the third threshold is less than the first threshold and the fourth threshold is less than the second threshold.
As a preferred embodiment, further comprising:
a fifth obtaining module, configured to obtain a memory capacity allocation ratio of the server before restarting after changing the current memory initialization logic;
the restarting module is used for controlling the restarting of the server;
a sixth obtaining module, configured to obtain a memory capacity allocation ratio after restarting the server;
the memory capacity allocation ratio judging module is used for judging whether the memory capacity allocation ratio after restarting the server is not more than the memory capacity allocation ratio before restarting the server; if yes, triggering the capacity expansion module; if not, triggering the deployment module;
The capacity expansion module is used for performing memory virtual expansion on the capacity of each memory;
the over-frequency module is used for performing over-frequency processing on the central processing unit of the server;
the second allocation module is used for allocating the capacity of each memory after virtual expansion of the memory according to the memory capacity allocation ratio before restarting of the server;
the deployment module is used for carrying out capacity deployment and regulation on the system capacity of the server according to the memory capacity distribution ratio after restarting the server.
As a preferred embodiment, further comprising:
the third threshold judging module is used for judging whether the ratio of the number of the fault memories to the number of all memories is not greater than a fifth threshold, and the ratio of the capacity of the fault memories to the capacity of all memories is not greater than a sixth threshold; if not, triggering a confirmation sending module;
the confirming and sending module is used for confirming that the fault memory duty ratio exceeds the limit and sending a prompt signal to a baseboard management controller of the server;
wherein the fifth threshold is greater than the first threshold and the sixth threshold is greater than the second threshold.
As a preferred embodiment, further comprising:
the output module is used for outputting information of successful restarting of the server after restarting the server;
And the log updating module is used for updating the running log of the server so as to record the restarting reason of the server.
In this embodiment, the memory failure processing apparatus includes a first acquisition module, a second acquisition module, a determination module, and an initialization module. The memory fault processing device can realize all the steps of the memory fault processing method when in operation. When the server detects that a memory fault exists after power-on, the whole memory information of all memories in the server is obtained; the overall memory information comprises the number and the total capacity of all memories; acquiring fault memory information of a fault memory in a server; the fault memory information comprises the number of the fault memories and the capacity of the fault memories; judging whether the fault memory information meets preset conditions according to the overall memory information; if not, skipping the operation fault memory, and initializing the rest memories according to the current memory initialization logic to restart the server. Therefore, when the server detects the memory fault, the scheme determines the influence condition of the fault memory on the starting of the server by acquiring the whole memory information of all memories in the server and the fault memory information of the fault memory and taking the preset condition as a reference; specifically, when the preset condition is not met, the fault memory is skipped to run, so that the influence on the running of the server is avoided, the server can be restarted normally, and the problem that the server cannot be started normally due to the memory fault when the memory isolation is executed by the server is solved.
Fig. 3 is a schematic diagram of a memory failure processing device according to an embodiment of the present invention. As shown in fig. 3, the memory failure processing apparatus includes:
a memory 20 for storing a computer program.
The processor 21 is configured to implement the steps of the memory failure processing method as mentioned in the above embodiment when executing the computer program.
The memory fault handling device provided in this embodiment may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, or the like.
Processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc. The processor 21 may be implemented in hardware in at least one of a digital signal processor (Digital Signal Processor, DSP), a Field programmable gate array (Field-Programmable Gate Array, FPGA), a programmable logic array (Programmable Logic Array, PLA). The processor 21 may also comprise a main processor, which is a processor for processing data in an awake state, also called central processor (Central Processing Unit, CPU), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a graphics processor (Graphics Processing Unit, GPU) for use in connection with rendering and rendering of content to be displayed by the display screen. In some embodiments, the processor 21 may also include an artificial intelligence (Artificial Intelligence, AI) processor for processing computing operations related to machine learning.
Memory 20 may include one or more computer-readable storage media, which may be non-transitory. Memory 20 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 20 is at least used for storing a computer program 201, where the computer program, when loaded and executed by the processor 21, can implement the relevant steps of the memory failure processing method disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 20 may further include an operating system 202, data 203, and the like, where the storage manner may be transient storage or permanent storage. The operating system 202 may include Windows, unix, linux, among others. Data 203 may include, but is not limited to, data related to memory failure handling methods.
In some embodiments, the memory fault handling device may further include a display 22, an input/output interface 23, a communication interface 24, a power supply 25, and a communication bus 26.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is not limiting of the memory failure handling apparatus and may include more or fewer components than shown.
In this embodiment, the memory failure processing apparatus includes a memory and a processor. The memory is used for storing a computer program. The processor is configured to implement the steps of the memory failure handling method as mentioned in the above embodiments when executing the computer program. When the server detects that a memory fault exists after power-on, the whole memory information of all memories in the server is obtained; the overall memory information comprises the number and the total capacity of all memories; acquiring fault memory information of a fault memory in a server; the fault memory information comprises the number of the fault memories and the capacity of the fault memories; judging whether the fault memory information meets preset conditions according to the overall memory information; if not, skipping the operation fault memory, and initializing the rest memories according to the current memory initialization logic to restart the server. Therefore, when the server detects the memory fault, the scheme determines the influence condition of the fault memory on the starting of the server by acquiring the whole memory information of all memories in the server and the fault memory information of the fault memory and taking the preset condition as a reference; specifically, when the preset condition is not met, the fault memory is skipped to run, so that the influence on the running of the server is avoided, the server can be restarted normally, and the problem that the server cannot be started normally due to the memory fault when the memory isolation is executed by the server is solved.
Finally, the invention also provides a corresponding embodiment of the computer readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps as described in the method embodiments above.
It will be appreciated that the methods of the above embodiments, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored on a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium for performing all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In this embodiment, a computer program is stored on a computer readable storage medium, and when the computer program is executed by a processor, the steps described in the above method embodiments are implemented. When the server detects that a memory fault exists after power-on, the whole memory information of all memories in the server is obtained; the overall memory information comprises the number and the total capacity of all memories; acquiring fault memory information of a fault memory in a server; the fault memory information comprises the number of the fault memories and the capacity of the fault memories; judging whether the fault memory information meets preset conditions according to the overall memory information; if not, skipping the operation fault memory, and initializing the rest memories according to the current memory initialization logic to restart the server. Therefore, when the server detects the memory fault, the scheme determines the influence condition of the fault memory on the starting of the server by acquiring the whole memory information of all memories in the server and the fault memory information of the fault memory and taking the preset condition as a reference; specifically, when the preset condition is not met, the fault memory is skipped to run, so that the influence on the running of the server is avoided, the server can be restarted normally, and the problem that the server cannot be started normally due to the memory fault when the memory isolation is executed by the server is solved.
The memory fault processing method, the device, the equipment and the medium provided by the invention are described in detail. In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and these modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.
It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. The memory fault processing method is characterized by comprising the following steps of:
when a server detects that a memory fault exists after power-on, acquiring the overall memory information of all memories in the server; wherein the overall memory information comprises the total memory quantity and total capacity;
acquiring fault memory information of the fault memory in the server; wherein the fault memory information comprises the number of the fault memories and the capacity of the fault memories;
judging whether the fault memory information meets a preset condition according to the whole memory information;
if not, skipping operation of the fault memory, and initializing the rest memories according to the current memory initialization logic so as to restart the server.
2. The memory failure processing method according to claim 1, wherein the determining whether the failed memory information satisfies a preset condition according to the overall memory information includes:
judging whether the ratio of the number of the fault memories to the number of all the memories is not greater than a first threshold value, and the ratio of the capacity of the fault memories to the capacity of all the memories is not greater than a second threshold value;
if not, the step of skipping the operation of the fault memory is carried out, wherein the preset condition is not met.
3. The memory failure handling method of claim 1, further comprising, after said skipping operation of said failed memory and initializing the remaining memory according to current memory initialization logic:
acquiring the capacity of other memories according to the total capacity of all memories and the capacity of the fault memory; the other memories are memories except the fault memory in all the memories;
acquiring the system capacity of the server according to the quantity of other memories and the capacity of the other memories;
and distributing the system capacity of the server to a system of the server.
4. The memory fault handling method of claim 2, further comprising:
judging whether the ratio of the number of the fault memories to the number of all the memories is not greater than a third threshold value, and the ratio of the capacity of the fault memories to the capacity of all the memories is not greater than a fourth threshold value;
if yes, changing the current memory initialization logic;
wherein the third threshold is less than the first threshold and the fourth threshold is less than the second threshold.
5. The memory fault handling method of claim 4, further comprising, after said changing the current memory initialization logic:
Acquiring a memory capacity allocation ratio of the server before restarting;
controlling the server to restart;
acquiring the memory capacity distribution ratio after restarting the server;
judging whether the memory capacity distribution ratio after restarting the server is not more than the memory capacity distribution ratio before restarting the server;
if yes, performing memory virtual expansion on the capacity of each memory;
performing over-frequency processing on a central processing unit of the server;
distributing the capacity of each memory after virtual expansion of the memory according to the memory capacity distribution ratio before restarting the server;
if not, performing capacity deployment and regulation on the system capacity of the server according to the memory capacity distribution ratio after restarting the server.
6. The memory fault handling method according to claim 4 or 5, further comprising:
judging whether the ratio of the number of the fault memories to the number of all the memories is not greater than a fifth threshold value, and the ratio of the capacity of the fault memories to the capacity of all the memories is not greater than a sixth threshold value;
if not, confirming that the fault memory ratio exceeds the limit, and sending a prompt signal to a baseboard management controller of the server;
Wherein the fifth threshold is greater than the first threshold and the sixth threshold is greater than the second threshold.
7. The memory failure processing method according to claim 6, further comprising, after restarting the server:
outputting information of successful restarting of the server;
and updating the running log of the server to record the restart reason of the server.
8. A memory failure handling apparatus, comprising:
the first acquisition module is used for acquiring the whole memory information of all memories in the server when the memory fault is detected after the server is electrified; wherein the overall memory information comprises the total memory quantity and total capacity;
the second acquisition module is used for acquiring the fault memory information of the fault memory in the server; wherein the fault memory information comprises the number of the fault memories and the capacity of the fault memories;
the judging module is used for judging whether the fault memory information meets preset conditions according to the whole memory information; if not, triggering an initialization module;
the initialization module is used for skipping operation of the fault memory, initializing the rest memories according to the current memory initialization logic, and restarting the server.
9. A memory failure processing apparatus, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the memory failure handling method according to any one of claims 1 to 7 when executing the computer program.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the memory failure handling method according to any of claims 1 to 7.
CN202310796615.9A 2023-06-30 2023-06-30 Memory fault processing method, device, equipment and medium Pending CN116841789A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310796615.9A CN116841789A (en) 2023-06-30 2023-06-30 Memory fault processing method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310796615.9A CN116841789A (en) 2023-06-30 2023-06-30 Memory fault processing method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN116841789A true CN116841789A (en) 2023-10-03

Family

ID=88173699

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310796615.9A Pending CN116841789A (en) 2023-06-30 2023-06-30 Memory fault processing method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN116841789A (en)

Similar Documents

Publication Publication Date Title
CN110083494B (en) Method and apparatus for managing hardware errors in a multi-core environment
US20140376362A1 (en) Dynamic client fail-over during a rolling patch installation based on temporal server conditions
US9910664B2 (en) System and method of online firmware update for baseboard management controller (BMC) devices
CN102027453B (en) System and method for optimizing interrupt processing in virtualized environments
US8171276B2 (en) Minimal startup mode for performing multiple-reboot configuration
US20100186010A1 (en) Dynamic Checking of Hardware Resources for Virtual Environments
US8589647B2 (en) Apparatus and method for synchronizing a snapshot image
CN110515917B (en) Method, device and medium for controlling reconstruction speed
CN112506745B (en) Memory temperature reading method and device and computer readable storage medium
US10649763B2 (en) Resource efficient deployment of multiple hot patches
CN111885420B (en) Standby protection method and device, smart television and readable storage medium
CN108897603B (en) Memory resource management method and device
CN116841789A (en) Memory fault processing method, device, equipment and medium
CN114048465B (en) Stack state detection method, device, equipment and storage medium
CN115277398A (en) Cluster network configuration method and device
US20210173698A1 (en) Hosting virtual machines on a secondary storage system
CN114153503A (en) BIOS control method, device and medium
US10877771B2 (en) Virtual machine booting using disk metadata
CN116841584A (en) Firmware upgrading method, device, equipment and medium
CN117873580A (en) Operating system switching method, chip and device
CN117950733A (en) Computing device, operating system starting method, operating method and operating device
CN117032983A (en) Method and device for adjusting scheduling type of CPU, storage medium and electronic equipment
CN117407270A (en) Performance test method, device, equipment to be tested and computer storage medium
CN116627515A (en) Partition switching starting method and device of embedded system
CN118113355A (en) Operating system switching method, chip and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination