CN114090316A - Memory fault processing method and device, storage medium and electronic equipment - Google Patents

Memory fault processing method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN114090316A
CN114090316A CN202111348504.9A CN202111348504A CN114090316A CN 114090316 A CN114090316 A CN 114090316A CN 202111348504 A CN202111348504 A CN 202111348504A CN 114090316 A CN114090316 A CN 114090316A
Authority
CN
China
Prior art keywords
memory
preset
attribute information
correctable errors
preset monitoring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111348504.9A
Other languages
Chinese (zh)
Inventor
葛士建
张宇
段熊春
聂海涛
许晓菡
王剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN202111348504.9A priority Critical patent/CN114090316A/en
Publication of CN114090316A publication Critical patent/CN114090316A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • Retry When Errors Occur (AREA)

Abstract

The disclosure relates to a memory fault processing method, a memory fault processing device, a storage medium and an electronic device, wherein the method comprises the steps of determining the number of correctable errors of a memory within a preset monitoring time; determining a memory page corresponding to the memory correctable errors when the number of the memory correctable errors occurring within the preset monitoring duration reaches a preset number; and carrying out dynamic off-line processing on the memory page. The corresponding memory pages with a large number of continuous memory correctable errors can be prevented from being used by application layer software, so that more serious memory uncorrectable errors caused by the memory are avoided.

Description

Memory fault processing method and device, storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of memory technologies, and in particular, to a memory fault processing method and apparatus, a storage medium, and an electronic device.
Background
The correctable memory error is a memory error that can be automatically corrected by a Central Processing Unit (CPU), while the uncorrectable memory error is a memory error that cannot be corrected by the CPU, and a large number of correctable memory errors may cause system performance degradation or may be derived as uncorrectable memory errors, which may cause a server to crash and restart, thereby affecting services. Therefore, how to avoid the occurrence of memory uncorrectable errors is crucial.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In a first aspect, the present disclosure provides a memory fault handling method, including:
determining the number of correctable errors of the memory within a preset monitoring time;
determining a memory page corresponding to the memory correctable errors when the number of the memory correctable errors occurring within the preset monitoring duration reaches a preset number;
and carrying out dynamic off-line processing on the memory page.
In a second aspect, the present disclosure provides a memory failure processing apparatus, including:
the first determining module is used for determining the number of correctable errors of the memory within the preset monitoring time;
a second determining module, configured to determine a memory page in which a correctable error of the memory occurs when the number of correctable errors of the memory that occur within the preset monitoring duration reaches a preset number;
and the processing module is used for carrying out dynamic off-line processing on the memory page.
In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect.
In a fourth aspect, the present disclosure provides an electronic device comprising:
a storage device having one or more computer programs stored thereon;
one or more processing devices for executing the one or more computer programs in the storage device to implement the steps of the method in the first aspect.
By the technical scheme, under the condition that the number of the memory correctable errors occurring within the preset monitoring duration reaches the preset number, the memory pages corresponding to the memory correctable errors are determined, and dynamic offline processing is performed on the memory pages, so that the memory pages corresponding to a large number of memory correctable errors which are continuously generated are not used by the program any more, further, the memory uncorrectable errors caused by the memory are avoided, the system downtime phenomenon is avoided, and the health of the memory space used by the program is further ensured; in addition, by setting the preset monitoring duration, the situation that the memory pages are dynamically processed offline due to the fact that the preset number of correctable errors of the memory is accumulated for a long time is avoided, the influence of the dynamic offline processing of the memory pages on the system performance and the available capacity of the memory is reduced, and the system performance and the available capacity of the memory are well maintained.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:
fig. 1 is a flowchart illustrating a memory failure handling method according to an exemplary embodiment of the present disclosure.
Fig. 2 is a block diagram illustrating a memory failure handling apparatus according to an exemplary embodiment of the present disclosure.
Fig. 3 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
As mentioned in the background, the cpu cannot actively correct the memory uncorrectable error, and therefore, it is important how to avoid the memory uncorrectable error.
In order to facilitate understanding of the present disclosure, key technologies and terms involved in the embodiments of the present disclosure are explained below.
The memory correctable errors and memory uncorrectable errors, which are memory errors, i.e. errors occurred in one memory granule, and if an error occurred in another memory granule at this time, the system will generate a memory uncorrectable error.
And dynamic offline processing, which refers to a memory page isolation technology. Memory page isolation is a technique in which an Operating System (OS) layer isolates the use of memory pages. After the memory page is isolated, the memory page can not be used by a program, wherein the program can be a program corresponding to application software, for example.
The following describes the detailed technical solution of the present disclosure in detail with reference to specific examples.
Fig. 1 is a flowchart illustrating a memory failure handling method according to an exemplary embodiment of the present disclosure, where the memory failure handling method may be applied to an electronic device such as a server, and referring to fig. 1, the memory failure handling method may include the following steps:
step S101, determining the number of correctable errors of the memory within a preset monitoring time.
In some embodiments, the number of memory-correctable errors occurring within the preset monitoring duration may be determined in real time at the time of server startup.
In some embodiments, the preset monitoring duration may be 1/30 seconds.
In some embodiments, the preset monitoring duration may be dynamically configured according to historical monitoring conditions. By way of example, the process of dynamically configuring may include: acquiring attribute information; processing the attribute information according to the pre-trained time prediction model to obtain the predicted monitoring duration; and updating the preset monitoring duration according to the predicted monitoring duration.
Wherein the attribute information includes environment attribute information and device attribute information. For example, the environment attribute information may include environment temperature and humidity information, the device attribute information may include memory particle attribute information, and the memory particle attribute information may be, for example, batch information of memory particles.
It should be noted that the time prediction model is used for representing a mapping relationship between the predicted monitoring duration and the attribute information. One attribute information uniquely corresponds to one predicted monitoring duration, so that the predicted monitoring duration corresponding to the attribute information can be obtained by processing the attribute information according to the time prediction model.
It should be noted that if a preset number of memory correctable errors occur within the preset monitoring duration, a memory uncorrectable error is likely to occur, and therefore, the number of memory correctable errors occurring within the duration is counted, so that the occurrence of a memory uncorrectable error can be avoided.
In some embodiments, the temporal prediction model may be trained from historical experimental data. For example, the training samples of the time prediction model include sample attribute information and sample time labels, the neural network model is trained through the sample attribute information and the sample time labels to obtain the time prediction model, and the obtained time prediction model is used for reflecting a fitting function of the sample attribute information and the sample time labels.
In some embodiments, the preset monitoring duration may be replaced with a predicted monitoring duration predicted by the time prediction model, so as to update the preset monitoring duration.
In some embodiments, the attribute information may be acquired when the number of the memory correctable errors occurring within the preset monitoring duration reaches a preset number, the attribute information may be processed according to the time prediction model, and the preset monitoring duration may be updated according to the obtained predicted monitoring duration.
In some embodiments, since the device attribute information and the environment temperature and humidity attribute information are relatively stable in a certain period, and the possibility that the preset monitoring duration is inconsistent with the predicted monitoring duration is low, the step of obtaining the attribute information may also be performed at intervals of a certain time period, the attribute information is processed according to the time prediction model, and the preset monitoring duration is updated according to the obtained predicted monitoring duration, so as to avoid invalid occupation of system resources. It is understood that the step of obtaining attribute information is performed to indicate that the current attribute information is significantly changed from the attribute information before the step of obtaining attribute information is performed.
In some embodiments, the attribute information may be acquired when the number of the correctable errors of the memory within the target duration reaches a preset number, the attribute information may be processed according to the time prediction model, and the preset monitoring duration may be updated according to the obtained predicted monitoring duration. And the time length corresponding to the time length which is smaller than the preset difference value with the preset monitoring time length is the target time length. The following describes an exemplary update of the preset monitoring duration when the number of memory-correctable errors occurring within the target duration reaches a preset number.
For example, at an initial time, initializing a timer and a counter, wherein the timer is used for timing after the initial time, and the counter is used for counting the number of correctable errors of a memory within a preset monitoring duration; determining a target time length according to the counting and starting time of the timer under the condition that the number of the memory correctable errors generated in the preset monitoring time length reaches a preset number; and executing the step of acquiring the attribute information under the condition that the difference value between the preset monitoring time length and the target time length is greater than the preset difference value.
The time length of the preset monitoring time length is the time length from the starting time to the ending time.
It should be noted that the preset difference may be set according to actual situations, and the embodiment is not limited herein.
It can be understood that, since the environmental attribute information and the device attribute information may affect the preset monitoring duration, if the number of memory correctable errors that have occurred when the duration within the preset monitoring duration is monitored reaches the preset number, it indicates that the currently set preset monitoring duration does not match the actual situation, since the preset monitoring duration is an optimal setting considering the overall profit (including memory capacity profit and system performance profit), and if the current preset monitoring duration does not match the predicted monitoring duration corresponding to the current attribute information, it indicates that the currently set preset monitoring duration does not make the overall profit the highest, therefore, under the condition that the currently set preset monitoring duration does not match the actual situation, the preset monitoring duration may be updated to ensure that the overall profit of the system is the highest. In addition, considering that the difference value between the target duration and the preset monitoring duration is negligible, the overall profit cannot be changed greatly, and therefore, under the condition that the difference value between the preset monitoring duration and the target duration is large, the step of obtaining the attribute information is executed to update the preset monitoring duration according to the obtained attribute information, and therefore invalid occupation of system resources is avoided.
Step S102, determining a memory page corresponding to the memory correctable error when the number of the memory correctable errors occurring within the preset monitoring duration reaches a preset number.
The preset number can be determined by comprehensively considering the influence of the memory page isolation on the system performance. For example, the preset number may be 15, and if 15 memory correctable errors occur within 1/30 seconds, the memory pages corresponding to the 15 memory correctable errors may be dynamically offline processed. It should be noted that 15 correctable errors occur within 1/30 seconds, which affects the performance of the system and is also easy to cause uncorrectable errors. In addition, this example does not constitute a limitation on the preset number. The different preset monitoring durations may correspond to different preset numbers, or may correspond to the same preset number, which is not limited herein.
It should be noted that, when a correctable memory error occurs, information describing the correctable memory error is reported, where the information may include memory address information used for describing the correctable memory error, and a memory page corresponding to the correctable memory error that occurs may be determined according to the memory address information. The detailed description may refer to related technologies, which are not described herein.
Step S103, dynamic off-line processing is carried out on the memory page.
In some embodiments, memory pages may be dynamically processed offline without killing processes that utilize the memory pages. It is to be understood that the process using the memory pages is a process in which the program is in use, and the process of killing the memory pages is closing the program using the memory pages. For example, a process using a memory page may be subjected to advanced memory migration, the process may be migrated from a current memory page to another memory page, and then the memory page may be dynamically processed offline without killing the process using the memory page.
By the method, under the condition that the number of the memory correctable errors occurring within the preset monitoring duration reaches the preset number, the memory pages corresponding to the memory correctable errors are determined, and dynamic offline processing is performed on the memory pages, so that the memory pages corresponding to a large number of memory correctable errors which continuously occur are not used by the program any more, more serious memory uncorrectable errors caused by the memory are avoided, the system downtime phenomenon is avoided, and the health of the memory space used by the program is further ensured; in addition, by setting the preset monitoring duration, the situation that the memory pages are dynamically processed offline due to the fact that the preset number of correctable errors of the memory is accumulated for a long time is avoided, the influence of the dynamic offline processing of the memory pages on the system performance and the available capacity of the memory is reduced, and the system performance and the available capacity of the memory are well maintained.
In some embodiments, the memory failure processing method may further include: and under the condition that the number of the correctable errors of the memory reaches the preset number within the preset configuration time, converting the reporting processing mode of the correctable errors of the memory from an interruption mode to a polling mode.
It should be noted that, in the interrupt mode, when a memory correctable error occurs every time, the memory correctable error that occurs this time is reported, and the server interrupts the program being processed to correct the memory correctable error that occurs; in the polling mode, the correctable errors of the memory occurring in the period are reported at intervals of a certain time period, so that the times of interrupting the program being processed by the server are reduced, and the performance of the system is improved.
On the basis, it can be understood that, after the number of the memory correctable errors occurring in a short time is increased, frequent reporting of the occurring memory correctable errors and the processing of the occurring memory correctable errors affect the system of the system, and further affect the service. Therefore, by the above method, under the condition that the number of the correctable errors of the memory within the preset configuration duration reaches the preset number, the reporting processing mode of the correctable errors of the memory is switched from the interrupt mode to the polling mode, so as to reduce the influence degree on the system performance.
On the basis, the preset monitoring duration can be determined according to the service scene. For some service scenes with low requirements on system performance, a longer preset monitoring duration can be set, and for some service scenes with high requirements on system performance, a shorter preset monitoring duration can be set, so that unnecessary influence on services caused by system performance is avoided.
On the basis, the service scene can be used as attribute information, and the predicted monitoring duration is comprehensively determined by combining the environment attribute information and the equipment attribute information. In this case, the training sample of the time prediction model includes the sample attribute information and includes the service scene attribute information, the environment attribute information, and the device attribute information, so as to train and obtain the time prediction model.
Based on the same concept, the present disclosure provides a memory failure processing apparatus, and fig. 2 is a memory failure processing apparatus according to an exemplary embodiment of the present disclosure, and referring to fig. 2, the apparatus 200 includes:
a first determining module 201, configured to determine the number of correctable errors of a memory occurring within a preset monitoring duration;
a second determining module 202, configured to determine a memory page in which a memory correctable error occurs when the number of memory correctable errors occurring within the preset monitoring duration reaches a preset number;
the processing module 203 is configured to perform dynamic offline processing on the memory page.
Optionally, the processing module 203 is specifically configured to perform dynamic offline processing on the memory page without killing a process that utilizes the memory page.
Optionally, the apparatus 200 further comprises:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring attribute information, and the attribute information comprises environment attribute information and equipment attribute information;
the prediction module is used for processing the attribute information according to a pre-trained time prediction model to obtain the predicted monitoring duration;
and the updating module is used for updating the preset monitoring duration according to the predicted monitoring duration.
Optionally, the time length of the preset monitoring time period is a time length from the start time to the end time, and the apparatus 200 further includes:
the initialization module is used for initializing a timer and a counter at the starting time, wherein the timer is used for timing after the starting time, and the counter is used for counting the number of correctable errors of the memory within the preset monitoring duration;
a third determining module, configured to determine a target duration according to the count of the timer and the start time when the number of correctable errors of the memory occurring within the preset monitoring duration reaches a preset number;
the obtaining module is specifically configured to obtain the attribute information when a difference between the preset monitoring duration and the target duration is greater than a preset difference.
Optionally, the environment attribute information includes environment temperature and humidity information, and the device attribute information includes memory particle attribute information.
Optionally, the preset monitoring time period is 1/30 seconds.
Optionally, the apparatus 200 further includes a mode conversion module, where the mode conversion module is configured to convert the reporting processing mode of the memory correctable errors from an interrupt mode to a polling mode when the number of the memory correctable errors within the preset configuration time reaches a preset number.
Based on the same concept, the present disclosure also provides a computer-readable medium, on which a computer program is stored, which, when being executed by a processing device, realizes the steps of the method described in the above-mentioned method embodiments.
Based on the same concept, the present disclosure also provides an electronic device, comprising:
a storage device having a computer program stored thereon;
processing means for executing said computer program in said storage means, the steps of the method described in the above method embodiments.
Referring now to FIG. 3, a block diagram of an electronic device 300 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 3, the electronic device 300 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 301 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)302 or a program loaded from a storage means 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data necessary for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.
Generally, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 308 including, for example, magnetic tape, hard disk, etc.; and a communication device 309. The communication means 309 may allow the electronic device 300 to communicate wirelessly or by wire with other devices to exchange data. While fig. 3 illustrates an electronic device 300 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 309, or installed from the storage means 308, or installed from the ROM 302. The computer program, when executed by the processing device 301, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some implementations, the electronic devices may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: determining the number of correctable errors of the memory within a preset monitoring time; determining a memory page corresponding to the memory correctable errors when the number of the memory correctable errors occurring within the preset monitoring duration reaches a preset number; and carrying out dynamic off-line processing on the memory page.
Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of the module does not constitute a limitation to the module itself in some cases, for example, the first determining module may also be described as a module for determining the number of memory correctable errors occurring within a preset monitoring duration.
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Example 1 provides a memory failure handling method according to one or more embodiments of the present disclosure, including:
determining the number of correctable errors of the memory within a preset monitoring time;
determining a memory page corresponding to the memory correctable errors when the number of the memory correctable errors occurring within the preset monitoring duration reaches a preset number;
and carrying out dynamic off-line processing on the memory page.
Example 2 provides the method of example 1, wherein the performing dynamic offline processing on the memory page includes:
and dynamically performing offline processing on the memory page under the condition that the process using the memory page is not killed.
Example 3 provides the method of example 1, further comprising, in accordance with one or more embodiments of the present disclosure:
acquiring attribute information, wherein the attribute information comprises environment attribute information and equipment attribute information;
processing the attribute information according to a pre-trained time prediction model to obtain a predicted monitoring duration;
and updating the preset monitoring duration according to the predicted monitoring duration.
Example 4 provides the method of example 3, wherein the preset monitoring duration has a duration from a start time to an end time, and the method further includes:
initializing a timer and a counter at the starting time, wherein the timer is used for timing after the starting time, and the counter is used for counting the number of correctable errors of the memory within the preset monitoring time length;
determining a target time length according to the counting of the timer and the starting time under the condition that the number of the memory correctable errors generated in the preset monitoring time length reaches a preset number;
and executing the step of obtaining the attribute information under the condition that the difference value between the preset monitoring time length and the target time length is greater than a preset difference value.
Example 5 provides the method of example 3, the environment attribute information including environment temperature and humidity information, and the device attribute information including memory particle attribute information, according to one or more embodiments of the present disclosure.
Example 6 provides the method of example 1, wherein the preset monitoring duration is 1/30 seconds, in accordance with one or more embodiments of the present disclosure.
Example 7 provides the method of example 1, further comprising, in accordance with one or more embodiments of the present disclosure:
and under the condition that the number of the correctable errors of the memory within the preset configuration duration reaches a preset number, converting the reporting processing mode of the correctable errors of the memory from an interrupt mode to a polling mode.
Example 8 provides a memory failure handling apparatus, according to one or more embodiments of the present disclosure, including:
the first determining module is used for determining the number of correctable errors of the memory within the preset monitoring time;
a second determining module, configured to determine a memory page in which a correctable error of the memory occurs when the number of correctable errors of the memory that occur within the preset monitoring duration reaches a preset number;
and the processing module is used for carrying out dynamic off-line processing on the memory page.
Example 9 provides the apparatus of example 8, and the processing module is specifically configured to perform dynamic offline processing on the memory page without killing a process that utilizes the memory page.
Example 10 provides the apparatus of example 8, the apparatus further comprising, in accordance with one or more embodiments of the present disclosure:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring attribute information, and the attribute information comprises environment attribute information and equipment attribute information;
the prediction module is used for processing the attribute information according to a pre-trained time prediction model to obtain the predicted monitoring duration;
and the updating module is used for updating the preset monitoring duration according to the predicted monitoring duration.
Example 11 provides the apparatus of example 10, the length of the preset monitoring period being a period from a start time to an end time, the apparatus further comprising:
the initialization module is used for initializing a timer and a counter at the starting time, wherein the timer is used for timing after the starting time, and the counter is used for counting the number of correctable errors of the memory within the preset monitoring duration;
a third determining module, configured to determine a target duration according to the count of the timer and the start time when the number of correctable errors of the memory occurring within the preset monitoring duration reaches a preset number;
the obtaining module is specifically configured to obtain the attribute information when a difference between the preset monitoring duration and the target duration is greater than a preset difference.
Example 12 provides the apparatus of example 10, the environmental attribute information including environmental temperature and humidity information, the device attribute information including memory particle attribute information, according to one or more embodiments of the present disclosure.
Example 13 provides the apparatus of example 8, the preset monitoring duration being 1/30 seconds, in accordance with one or more embodiments of the present disclosure.
Example 14 provides the apparatus of example 8, which further includes a mode conversion module, where the mode conversion module is configured to, when the number of errors that can be corrected by the memory within the preset configuration duration reaches a preset number, convert the reporting processing mode for the errors that can be corrected by the memory from an interrupt mode to a polling mode.
Example 15 provides a computer readable medium having stored thereon a computer program that, when executed by a processing apparatus, implements the steps of the method of any one of examples 1 to 7, in accordance with one or more embodiments of the present disclosure.
Example 16 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising:
a storage device having one or more computer programs stored thereon;
one or more processing devices to execute the one or more computer programs in the storage device to implement the steps of the method of any of examples 1-7.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims (10)

1. A memory fault handling method is characterized by comprising the following steps:
determining the number of correctable errors of the memory within a preset monitoring time;
determining a memory page corresponding to the memory correctable errors when the number of the memory correctable errors occurring within the preset monitoring duration reaches a preset number;
and carrying out dynamic off-line processing on the memory page.
2. The method according to claim 1, wherein the dynamically offline processing of the memory page includes:
and dynamically performing offline processing on the memory page under the condition that the process using the memory page is not killed.
3. The method of claim 1, further comprising:
acquiring attribute information, wherein the attribute information comprises environment attribute information and equipment attribute information;
processing the attribute information according to a pre-trained time prediction model to obtain a predicted monitoring duration;
and updating the preset monitoring duration according to the predicted monitoring duration.
4. The method of claim 3, wherein the preset monitoring period has a time length from a start time to an end time, and the method further comprises:
initializing a timer and a counter at the starting time, wherein the timer is used for timing after the starting time, and the counter is used for counting the number of correctable errors of the memory within the preset monitoring time length;
determining a target time length according to the counting of the timer and the starting time under the condition that the number of the memory correctable errors generated in the preset monitoring time length reaches a preset number;
and executing the step of obtaining the attribute information under the condition that the difference value between the preset monitoring time length and the target time length is greater than a preset difference value.
5. The method of claim 3, wherein the environmental attribute information comprises environmental temperature and humidity information, and the device attribute information comprises in-memory particle attribute information.
6. The method of claim 1, wherein the preset monitoring period is 1/30 seconds.
7. The method of claim 1, further comprising:
and under the condition that the number of the correctable errors of the memory within the preset configuration duration reaches a preset number, converting the reporting processing mode of the correctable errors of the memory from an interrupt mode to a polling mode.
8. A memory failure handling device, comprising:
the first determining module is used for determining the number of correctable errors of the memory within the preset monitoring time;
a second determining module, configured to determine a memory page in which a correctable error of the memory occurs when the number of correctable errors of the memory that occur within the preset monitoring duration reaches a preset number;
and the processing module is used for carrying out dynamic off-line processing on the memory page.
9. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 7.
10. An electronic device, comprising:
a storage device having one or more computer programs stored thereon;
one or more processing devices for executing the one or more computer programs in the storage device to implement the steps of the method of any one of claims 1-7.
CN202111348504.9A 2021-11-15 2021-11-15 Memory fault processing method and device, storage medium and electronic equipment Pending CN114090316A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111348504.9A CN114090316A (en) 2021-11-15 2021-11-15 Memory fault processing method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111348504.9A CN114090316A (en) 2021-11-15 2021-11-15 Memory fault processing method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN114090316A true CN114090316A (en) 2022-02-25

Family

ID=80300830

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111348504.9A Pending CN114090316A (en) 2021-11-15 2021-11-15 Memory fault processing method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN114090316A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114726713A (en) * 2022-03-02 2022-07-08 阿里巴巴(中国)有限公司 Node fault model training method, node fault model detection equipment, node fault model medium and node fault model product
CN115629905A (en) * 2022-12-21 2023-01-20 苏州浪潮智能科技有限公司 Memory fault early warning method and device, electronic equipment and readable medium
CN116820828A (en) * 2023-08-29 2023-09-29 苏州浪潮智能科技有限公司 Method and device for setting correctable error threshold, electronic equipment and storage medium
WO2023198189A1 (en) * 2022-04-16 2023-10-19 华为技术有限公司 Memory error prediction method and apparatus, and device
CN117076186A (en) * 2023-10-17 2023-11-17 苏州元脑智能科技有限公司 Memory fault detection method, system, device, medium and server
WO2024082844A1 (en) * 2022-10-18 2024-04-25 超聚变数字技术有限公司 Fault detection apparatus and detection method for random access memory

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445720A (en) * 2016-10-11 2017-02-22 郑州云海信息技术有限公司 Memory error recovery method and device
CN109328340A (en) * 2017-09-30 2019-02-12 华为技术有限公司 Detection method, device and the server of memory failure
CN110046061A (en) * 2019-03-01 2019-07-23 华为技术有限公司 EMS memory error treating method and apparatus
CN111104238A (en) * 2019-10-30 2020-05-05 苏州浪潮智能科技有限公司 CE-based memory diagnosis method, device and medium
CN111221775A (en) * 2018-11-23 2020-06-02 阿里巴巴集团控股有限公司 Processor, cache processing method and electronic equipment
CN112306732A (en) * 2020-11-19 2021-02-02 山东云海国创云计算装备产业创新中心有限公司 Automatic error correction control method, device, equipment and medium in server
CN112463492A (en) * 2020-12-04 2021-03-09 苏州浪潮智能科技有限公司 Method, system, equipment and medium for processing correctable errors of memory

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445720A (en) * 2016-10-11 2017-02-22 郑州云海信息技术有限公司 Memory error recovery method and device
CN109328340A (en) * 2017-09-30 2019-02-12 华为技术有限公司 Detection method, device and the server of memory failure
CN111221775A (en) * 2018-11-23 2020-06-02 阿里巴巴集团控股有限公司 Processor, cache processing method and electronic equipment
CN110046061A (en) * 2019-03-01 2019-07-23 华为技术有限公司 EMS memory error treating method and apparatus
CN111104238A (en) * 2019-10-30 2020-05-05 苏州浪潮智能科技有限公司 CE-based memory diagnosis method, device and medium
CN112306732A (en) * 2020-11-19 2021-02-02 山东云海国创云计算装备产业创新中心有限公司 Automatic error correction control method, device, equipment and medium in server
CN112463492A (en) * 2020-12-04 2021-03-09 苏州浪潮智能科技有限公司 Method, system, equipment and medium for processing correctable errors of memory

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114726713A (en) * 2022-03-02 2022-07-08 阿里巴巴(中国)有限公司 Node fault model training method, node fault model detection equipment, node fault model medium and node fault model product
CN114726713B (en) * 2022-03-02 2024-01-12 阿里巴巴(中国)有限公司 Node fault model training method, node fault model detection method, node fault model training equipment, node fault model medium and node fault model product
WO2023198189A1 (en) * 2022-04-16 2023-10-19 华为技术有限公司 Memory error prediction method and apparatus, and device
WO2024082844A1 (en) * 2022-10-18 2024-04-25 超聚变数字技术有限公司 Fault detection apparatus and detection method for random access memory
CN115629905A (en) * 2022-12-21 2023-01-20 苏州浪潮智能科技有限公司 Memory fault early warning method and device, electronic equipment and readable medium
CN116820828A (en) * 2023-08-29 2023-09-29 苏州浪潮智能科技有限公司 Method and device for setting correctable error threshold, electronic equipment and storage medium
CN116820828B (en) * 2023-08-29 2024-01-09 苏州浪潮智能科技有限公司 Method and device for setting correctable error threshold, electronic equipment and storage medium
CN117076186A (en) * 2023-10-17 2023-11-17 苏州元脑智能科技有限公司 Memory fault detection method, system, device, medium and server
CN117076186B (en) * 2023-10-17 2024-02-09 苏州元脑智能科技有限公司 Memory fault detection method, system, device, medium and server

Similar Documents

Publication Publication Date Title
CN114090316A (en) Memory fault processing method and device, storage medium and electronic equipment
CN111274503B (en) Data processing method, device, electronic equipment and computer readable medium
CN111198859A (en) Data processing method and device, electronic equipment and computer readable storage medium
CN112712801A (en) Voice wake-up method and device, electronic equipment and storage medium
CN110990038B (en) Method, apparatus, electronic device and medium for applying local update
CN110865846B (en) Application management method, device, terminal, system and storage medium
CN110928715A (en) Method, device, medium and electronic equipment for prompting error description information
CN111858381B (en) Application fault tolerance capability test method, electronic device and medium
CN113419841A (en) Message scheduling method and device, electronic equipment and computer readable medium
CN111198853B (en) Data processing method, device, electronic equipment and computer readable storage medium
CN112181733A (en) Service request processing method, device, equipment and storage medium
CN117035842A (en) Model training method, traffic prediction method, device, equipment and medium
CN110727558A (en) Information prompting method and device, storage medium and electronic equipment
CN112817701B (en) Timer processing method, device, electronic equipment and computer readable medium
CN113360348B (en) Abnormal request processing method and device, electronic equipment and storage medium
CN113176937B (en) Task processing method and device and electronic equipment
CN111680754B (en) Image classification method, device, electronic equipment and computer readable storage medium
CN111459893B (en) File processing method and device and electronic equipment
CN111949528A (en) Memory leak detection method and device, electronic equipment and storage medium
CN111628913A (en) Online time length determining method and device, readable medium and electronic equipment
CN115827415B (en) System process performance test method, device, equipment and computer medium
CN111404824B (en) Method, apparatus, electronic device, and computer-readable medium for forwarding request
CN112817666B (en) Timing method, timing device, electronic equipment and storage medium
CN115002557B (en) Network speed prediction method, device, equipment and storage medium
CN115379243B (en) CDN scheduling method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination