CN102369513A - Method for improving stability of computer system and computer system - Google Patents

Method for improving stability of computer system and computer system Download PDF

Info

Publication number
CN102369513A
CN102369513A CN2011800015124A CN201180001512A CN102369513A CN 102369513 A CN102369513 A CN 102369513A CN 2011800015124 A CN2011800015124 A CN 2011800015124A CN 201180001512 A CN201180001512 A CN 201180001512A CN 102369513 A CN102369513 A CN 102369513A
Authority
CN
China
Prior art keywords
computer system
equipment
misdata
information
dimm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011800015124A
Other languages
Chinese (zh)
Inventor
张斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN102369513A publication Critical patent/CN102369513A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/142Reconfiguring to eliminate the error
    • G06F11/1428Reconfiguring to eliminate the error with loss of hardware functionality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1417Boot up procedures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Hardware Redundancy (AREA)
  • Retry When Errors Occur (AREA)

Abstract

The invention relates to a method for improving the stability of a computer system and a computer system. The method for improving the stability of the computer system comprises the steps of collecting the wrong data generated by a computer system device when the computer system is started up or run; storing the wrong data to a nonvolatile memorizer; when the computer system is restarted up, carrying out a state recovery processing on the device generating the wrong data according to the wrong data; reading the wrong data stored in the nonvolatile memorizer to carry out a recovery processing on the corresponding device in the computer system by recording the wrong data in the nonvolatile memorizer during the running of the computer system and the restarting process, so that the handling failures to the abnormal or predicted to-be-damaged device of the computer system, such as forbidden operation, isolation operation, etc. caused by device state initialization after the computer system is restarted, and the problem casing the stability of the system to be reduced directly are solved, and the stability of the computer system is improved.

Description

Improve the method and the computer system of computer system stability
Technical field
The present invention relates to computer technology, particularly a kind of method and computer system that improves computer system stability.
Background technology
Because the key business (Mission Critical) that the high-end fault-tolerant computer system bearing industries such as finance, telecommunications, aviation, electric power; Need to guarantee continual operation in 24 hours in 365 days; And the correctness of assurance data; Stability, availability and serviceability (Reliability, Availability and Serviceability, the RAS) characteristic that therefore need possess height.Specifically, the stability requirement computing machine can continuous running, detection and correcting system mistake automatically.The valuable source of availability requirement computer system all has backup, can detect the potential problem that will take place, and can shift moving on it of task and arrive resource backup, normally moves to keep computer system, reduces downtime.Serviceability requires computer system to diagnose by real-time online, accurately orients the root problem place, accomplishes to repair fast accurately.
In the prior art, (Onboard Administrator OA) collects the Device Errors data of operating computer system, carries out the prediction of fault to utilize these misdatas through management on the plate usually.When the equipment failure number of times reaches setting threshold, launch alternate device or carry out the heat replacement.These misdatas have a strong impact on the computer system that restarts or have rolled off the production line, the stability of the equipment launched of reaching the standard grade once more afterwards.
Summary of the invention
The embodiment of the invention proposes a kind of method and computer system that improves computer system stability, to improve the stability of computer system.
The embodiment of the invention provides a kind of method that improves computer system stability, comprising:
Computer system is collected the misdata of the equipment generation of said computer system when starting or move;
Said misdata is stored in the nonvolatile memory;
When said computer system is restarted, according to said misdata the equipment that produces said misdata is carried out recovering state and handle.
The embodiment of the invention also provides a kind of method that improves computer system stability, comprising:
Computer system is in when operation, collects the abnormal information that unusual equipment takes place in the equipment of said computer system;
The said abnormal information that unusual equipment takes place is stored in the nonvolatile memory;
The said equipment that has rolled off the production line that said computer system is reached the standard grade to request according to said abnormal information again carries out recovering state.
The embodiment of the invention also provides a kind of computer system, comprising:
The mistake collector unit is used for when computer system operation or startup, collecting the misdata of the equipment generation of said computer system;
Storage unit is used for storing said misdata into nonvolatile memory;
Recover processing unit, be used for when said computer system is restarted, according to said misdata the equipment that produces said misdata is carried out recovering state and handle.
The embodiment of the invention also provides a kind of computer system, comprising:
The abnormal information collector unit is used for when computer system is moved, and collects the abnormal information that unusual equipment takes place in the equipment of said computer system;
Storage unit is used for storing the said abnormal information that unusual equipment takes place into nonvolatile memory;
The recovering state unit is used for carrying out recovering state according to the said equipment that has rolled off the production line that said abnormal information is reached the standard grade to request again.
The method and the computer system of the raising computer system stability that the embodiment of the invention provides; Through misdata being recorded in the nonvolatile memory in that computer system is in service; And the misdata in the reading non-volatile storage is carried out the recovering state processing to corresponding equipment in the computer system in restart procedure; The handling failures such as forbidding, isolation of computer system or equipment that prediction can damage unusual before restarting the computer system that back equipment state initialization causes have been solved to some; The problem that directly causes the reduction of system stability has improved the stability of computer system.
Description of drawings
A kind of process flow diagram that improves the method for computer system stability that Fig. 1 provides for the embodiment of the invention;
The another kind that Fig. 2 provides for the embodiment of the invention improves the process flow diagram of the method for computer system stability;
The equipment state that computer system is restarted under the situation in the method for the raising computer system stability that Fig. 3 provides for the embodiment of the invention is recovered synoptic diagram;
The BIOS policy configurations menu synoptic diagram of computer system in the method for the raising computer system stability that Fig. 4 provides for the embodiment of the invention;
The DIMM isolation is recovered process flow diagram in the method for the raising computer system stability that Fig. 5 provides for the embodiment of the invention;
Processor core recovery from errdisable processing flow chart in the method for the raising computer system stability that Fig. 6 provides for the embodiment of the invention;
The recovering state processing flow chart of buffer memory blocking information in the method for the raising computer system stability that Fig. 7 provides for the embodiment of the invention;
The recovering state processing flow chart that the unusual node of generation that has rolled off the production line in the method for the raising computer system stability that Fig. 8 provides for the embodiment of the invention is reached the standard grade again;
The structural representation of a kind of computer system that Fig. 9 provides for the embodiment of the invention;
The structural representation of the another kind of computer system that Figure 10 provides for the embodiment of the invention.
Embodiment
For making the object of the invention, technical scheme and advantage clearer, will combine accompanying drawing that the present invention is done to describe in detail further below.
A kind of process flow diagram that improves the method for computer system stability that Fig. 1 provides for the embodiment of the invention.Present embodiment carries out recovering state in last misdata of collecting when once moving to the equipment in the computer system to the computer system that restarts, to improve the stability of computer system.As shown in Figure 1, this method comprises:
Step 11, computer system are collected the misdata of the equipment generation of said computer system when starting or move; Misdata can be abnormal information, dual inline type memory module (Dual Inline Memory Modules, DIMM) blocking information of isolation information, processor core, the buffer memory blocking information etc. of equipment.
Step 12, with said store into nonvolatile memory (Non-Volatile Memory, NVM) in.
When misdata was the abnormal information of equipment, the abnormal information that unusual equipment will take place when operation computer system stored in the said nonvolatile memory;
This method also comprises: according to said abnormal information, the said equipment that has rolled off the production line that request is reached the standard grade again carries out recovering state to said computer system when operation.
When misdata was the DIMM isolation information, said computer system judged whether said DIMM is replaced when operation, if the said DIMM isolation information that then will be stored in the said nonvolatile memory is removed; Otherwise, when said computer system is restarted said DIMM is isolated.
Step 13, when said computer system is restarted, according to said misdata the equipment that produces said misdata is carried out recovering state and handle.
For example computer system is isolated DIMM corresponding in the said computer system according to the DIMM isolation information.Computer system forbids that according to the blocking information of processor core corresponding processing device nuclear is participated in processor startup processing threads (Processor Boot Strap Processor in the said computer system for another example; PBSP) selection, or forbid corresponding processing device nuclear in the said computer system.And for example computer system is forbidden corresponding cache in the said computer system again according to the buffer memory blocking information.
In the present embodiment; The misdata that computer system will be collected through will move the time is stored among the NVM; And through when restarting, corresponding device being carried out the recovering state processing according to misdata; Computer system is launched problematic or unsettled equipment after having avoided initialization as normal device, has improved the stability of computer system.
The another kind that Fig. 2 provides for the embodiment of the invention improves the process flow diagram of the method for computer system stability.Present embodiment causes the computer system problem of unstable to be handled to reaching the standard grade again after unusual equipment rolls off the production line in the computer system of operation separately.As shown in Figure 2, this method comprises:
Step 21, computer system are collected the abnormal information that unusual equipment takes place in the equipment of said computer system when operation;
Step 22, the said abnormal information that unusual equipment will take place store in the nonvolatile memory;
The said equipment that has rolled off the production line that step 23, said computer system are reached the standard grade to request according to said abnormal information again carries out recovering state.
Computer system is after the abnormal information that unusual equipment the time will take place in operation stores in the said nonvolatile memory; Also can comprise: said computer system judges whether said equipment is replaced; If then delete the said abnormal information in the said nonvolatile memory; Otherwise, carry out said recovering state.
In the present embodiment; Computer system is carried out recovering state through the equipment that has rolled off the production line of the request again of correspondence being reached the standard grade according to the abnormal information of storing among the NVM; The equipment of having avoided rolling off the production line owing to unit exception is unstable as reach the standard grade the again system that causes of normal device, has improved the stability of computer system.
For example when computer system is moved, if equipment wherein breaks down or by system disables, (Basic Input Output System BIOS) can be kept at these information among the NVM its Basic Input or Output System (BIOS).When computer system restarts; These information are analyzed and handled; Before unstable device start, carry out system configuration according to these information; Unstable equipment is isolated or forbidding, and unstable equipment is to restarting preceding state, to guarantee the stability of computer system in the reduction computer system.
The equipment state that computer system is restarted under the situation in the method for the raising computer system stability that Fig. 3 provides for the embodiment of the invention is recovered synoptic diagram.As shown in Figure 3; During the computer system operation, the misdatas such as DIMM isolation information, processor core health status information and buffer memory blocking information of collecting are stored among the NVM, when computer system is restarted; Open the BIOS collocation strategy; Promptly, obtain misdata, according to corresponding error data call corresponding apparatus recovering state distributing programs corresponding equipment is carried out recovering state and handle from NVM in BIOS configuration interface opening device state recovery function.
Further, the BIOS configuration interface also can be provided with switch, so that the user carries out equipment state recovery flexibly according to demand.As shown in Figure 4, open the collocation strategy switch of distinct device respectively according to demand, the equipment that only is in opening just can carry out recovering state according to corresponding strategy.When restarting beginning, when carrying out system initialization, system can access arrangement recovering state distributing programs.
Be example with DIMM isolation information shown in Figure 4, processor core health information and the disabled information of buffer memory below, the method that improves computer system stability is done further detailed explanation.
After these misdatas produce; BIOS is kept at these misdatas in the nonvolatile memory, when computer system restarts when carrying out initialization invocation facility recovering state distributing programs; From NVM, obtain these misdatas, handle according to different strategies.
Particularly, it is as shown in Figure 5 that the DIMM isolation is recovered flow process, comprising:
When step 51, computer system operation, and the bug check of a certain DIMM and correction (Error Correcting Code, ECC) mistake reaches defined threshold;
Step 52, computer system are through the Log information Recognition source that makes mistake;
Step 53, this DIMM of computer system mark are about to failure state, and with this DIMM forbidding;
Step 54, computer system are saved in inefficacy DIMM information among the NVM;
Step 55, computer system judge whether DIMM is replaced, if then execution in step 58; Otherwise, execution in step 56;
After restarting, step 56, computer system read the DIMM information of storing among the NVM;
Step 57, computer system invocation facility recovering state distributing programs, with the DIMM isolation of correspondence, computer system continues to start, the pilot operationp system (Operation System, OS).
Step 58, computer system are removed among the NVM should inefficacy DIMM information, and computer system reverts to health status.
In the present embodiment; DIMM in the computer system protects data wherein through mechanism such as ECC; Prescribe a time limit when finding that the ECC number of errors reaches on the preset threshold, system can be labeled as failure state and forbidding with this DIMM, and fail message is kept among the NVM.Computer system continue normally to run to restart during this period of time in, if DIMM changes through the hot plug flow process, then the fail message of this DIMM among the NVM is removed.Otherwise, when system restart, invocation facility recovering state distributing programs in start-up course, the fail message of all DIMM in the reading system uses configurator that it is isolated.
Processor core recovery from errdisable treatment scheme is as shown in Figure 6, comprising:
When step 61, computer system started in the first time, the health status of measurement processor nuclear;
Step 62, computer system be not to forbidding through the processor core that detects;
Step 63, this blocking information of computer system record are in NVM;
Step 64, PBSP select, and computer system continues to start, guiding OS;
When step 65, computer system started for the second time, invocation facility recovering state distributing programs read blocking information from NVM;
Step 66, computer system are forbidden that according to the blocking information that reads corresponding processing device nuclear participation PBSP selects, or are directly forbidden corresponding processing device nuclear;
Step 67, computer system continue to start, the health status of measurement processor nuclear;
Step 68, computer system be not to forbidding through the processor core that detects;
Step 69, computer system record blocking information among the NVM.
Afterwards, computer system is carried out the PBSP selection, and computer system continues to start, guiding OS.
Might detect some unsettled nuclears during system start-up, should nuclear in this starts can conductively-closed, but when starting next time; This nuclear but still might be through detecting; Further competition PBSP and SBSP become system's main thread, if at this moment should examine the generation problem, total system will be collapsed immediately.In this treatment scheme with the information stores of detected unstable nuclear in NVM, recover flow process identification through equipment state when start next time and limit unsettled nuclear and participate in the PBSP competition or directly should examine forbidding.
Recovering state treatment scheme to the buffer memory blocking information is as shown in Figure 7, comprising:
In step 71, the computer system, when certain buffer memory (Cacheline) generation ECC errors number reaches threshold value, this buffer memory of computer system forbidding.
Step 72, this blocking information of computer system record are in NVM;
Step 73, when computer system restarts, invocation facility recovering state distributing programs reads the buffer memory blocking information;
Step 74, the service of computer system using system reconfigure buffer memory.
Afterwards, computer system continues to start, guiding OS.
Stipulate in the buffer memory safety technique; After certain bar Cacheline ECC errors number reaches threshold value, can be by system disables, this treatment scheme is preserved this information through NVM; When computer system starting next time; Invocation facility recovering state distributing programs reads this information, carries out the Cacheline configuration, and the Cacheline that once was labeled as forbidding is closed again.
The unusual information of equipment generation all is saved among the NVM in the computer system of operation, supposes in the scene, and the user is because energy-conservation or other demand; The node unusual to a certain generation in the operating computer system carried out the request of rolling off the production line, after computer system has been accomplished the operation of rolling off the production line, because service needed; The node that user's request will have been rolled off the production line is reached the standard grade again, and the invocation facility of system's meeting at this moment recovering state program is recovered to handle to node state; Specifically as shown in Figure 8, comprising:
Equipment takes place unusual condition to be handled when unusual in the computer system of step 81, operation.
Step 82, abnormal information is remained among the NVM;
A certain node or equipment are rolled off the production line in step 83, user's request;
Step 84, the operation of rolling off the production line are accomplished, and computer system continues operation;
To roll off the production line node or equipment of step 85, user's request is reached the standard grade again;
Step 86, computer system invocation facility recovering state distributing programs in the operation of reaching the standard grade carries out the recovering state operation to each equipment;
Step 87, equipment are reached the standard grade;
Step 88, department of computer science continue operation.
This recovery operation is optional, and the switch that is provided with in the BIOS configuration interface that can adopt as shown in Figure 4 is configured when reaching the standard grade operation.
Maximum different of this treatment scheme and above-mentioned treatment scheme are not relate to computer system restarts, and whole operation is dynamically carried out.
Said method embodiment is through recording misdata among the NVM in that computer system is in service; And the misdata that in restart procedure, reads among the NVM is carried out the recovering state processing to corresponding equipment in the computer system; The handling failures such as forbidding, isolation of computer system or equipment that prediction can damage unusual before restarting the computer system that back equipment state initialization causes have been solved to some; The problem that directly causes the reduction of system stability has improved the stability of computer system.And; Reach the standard grade through the equipment in the computer system of operation and abnormal information to be saved among the NVM in the flow process; And go up the abnormal information of preserving according to NVM in the line process again at this equipment this equipment is recovered to handle, solved in the computer system operational process, after rolling off the production line, user's requesting service reaches the standard grade again; At this moment the state before equipment rolls off the production line can be initialised; L fraction (possibly be another subset) disabled in the equipment possibly run in the system once more, causes the problem of system stability decline, has improved the stability of computer system.
One of ordinary skill in the art will appreciate that: all or part of step that realizes said method embodiment can be accomplished through programmed instruction and relevant hardware; Aforesaid program can be stored in the computer read/write memory medium; This program the step that comprises said method embodiment when carrying out; And aforesaid storage medium comprises: various media that can be program code stored such as ROM, RAM, magnetic disc or CD.
The structural representation of a kind of computer system that Fig. 9 provides for the embodiment of the invention.As shown in Figure 9, computer system comprises: mistake collector unit 91, storage unit 92 and recovery processing unit 93.
Mistake collector unit 91 is used for when computer system operation or startup, collecting the misdata of the equipment generation of said computer system; Storage unit 92 is used for storing said misdata into NVM; Recover processing unit 93 and be used for when said computer system is restarted, root carries out recovering state according to said misdata to the equipment that produces said misdata to be handled.
When said misdata is the abnormal information of said equipment, said recovery processing unit 93 also is used for when said computer system operation according to said abnormal information, and the said equipment that has rolled off the production line that request is reached the standard grade again carries out recovering state.
Said for another example recovery processing unit 93 specifically is used for according to the DIMM isolation information the corresponding DIMM of said computer system being isolated.
Said computer system also can comprise: replacement judging unit and erasing of information unit.
The replacement judging unit is used for when said computer system is moved, judging whether said DIMM is replaced; The erasing of information unit is used for if the said DIMM of said replacement judgment unit judges is replaced, and the said DIMM isolation information that then will be stored in the said nonvolatile memory is removed; Correspondingly, said recovery processing unit is used for then when said computer system is restarted, said DIMM being isolated if the said DIMM of said replacement judgment unit judges is not replaced.
And for example said recovery processing unit 93 can specifically be used for forbidding according to the blocking information of processor core the selection of said computer system corresponding processing device nuclear participation PBSP, or forbids corresponding processing device nuclear in the said computer system.
And for example said recovery processing unit 93 can specifically be used for closing said computer system corresponding cache again according to the buffer memory blocking information.The recovery that misdata that above-mentioned wrong collector unit is collected and recovery processing unit carry out is handled operation and is specifically seen the explanation among the said method embodiment for details.
In the present embodiment; The misdata of collecting when computer system will be moved through wrong collector unit is stored among the NVM; And when restarting, corresponding device is carried out recovering state through the recovery processing unit and handle according to misdata; Computer system is launched problematic or unsettled equipment after having avoided initialization as normal device, has improved the stability of computer system.
The structural representation of the another kind of computer system that Figure 10 provides for the embodiment of the invention.Shown in figure 10, computer system comprises: abnormal information collector unit 101, storage unit 102 and recovering state unit 103.
Abnormal information collector unit 101 is used for when computer system is moved, and collects the abnormal information that unusual equipment takes place in the equipment of said computer system; Storage unit 102 is used for storing the said abnormal information that unusual equipment takes place into NVM; Recovering state unit 103 is used for carrying out recovering state according to the said equipment that has rolled off the production line that said abnormal information is reached the standard grade to request again.
The computer system that the embodiment of the invention provides also can comprise: replacement judging unit and information deletion unit.
The abnormal information that the replacement judging unit is used for when the operation of said computer system, will taking place unusual equipment judges whether said equipment is replaced after storing said nonvolatile memory into; The information deletion unit is used for then deleting the said abnormal information in the said nonvolatile memory if the said equipment of said replacement judgment unit judges is to be replaced; Correspondingly, said recovering state unit is used for then carrying out said recovering state if the said equipment of said replacement judgment unit judges is not replaced.
In the present embodiment; The equipment that has rolled off the production line that computer system is reached the standard grade to the request again of correspondence according to the abnormal information of storing among the NVM through the recovering state unit carries out recovering state; The equipment of having avoided rolling off the production line owing to unit exception is unstable as reach the standard grade the again system that causes of normal device, has improved the stability of computer system.
Each unit can be arranged among the BIOS among the said system embodiment.
What should explain at last is: above embodiment is only in order to explaining technical scheme of the present invention, but not to its restriction; Although with reference to previous embodiment the present invention has been carried out detailed explanation, those of ordinary skill in the art is to be understood that: it still can be made amendment to the technical scheme that aforementioned each embodiment put down in writing, and perhaps part technical characterictic wherein is equal to replacement; And these are revised or replacement, do not make the spirit and the scope of the essence disengaging various embodiments of the present invention technical scheme of relevant art scheme.

Claims (16)

1. a method that improves computer system stability is characterized in that, comprising:
Computer system is collected the misdata of the equipment generation of said computer system when starting or move;
Said misdata is stored in the nonvolatile memory;
When said computer system is restarted, according to said misdata the equipment that produces said misdata is carried out recovering state and handle.
2. the method for raising computer system stability according to claim 1 is characterized in that said misdata is the abnormal information of said equipment;
Said method also comprises: according to said abnormal information, the said equipment that has rolled off the production line that request is reached the standard grade again carries out recovering state to said computer system when operation.
3. the method for raising computer system stability according to claim 1 and 2 is characterized in that said misdata is a dual inline type memory module DIMM isolation information;
According to said misdata corresponding equipment in the said computer system is carried out the process that recovering state is handled, comprising:
According to said DIMM isolation information DIMM corresponding in the said computer system is isolated.
4. the method for raising computer system stability according to claim 3; It is characterized in that; Said computer system judges whether said DIMM is replaced when operation, if the said DIMM isolation information that then will be stored in the said nonvolatile memory is removed; Otherwise, when said computer system is restarted said DIMM is isolated.
5. the method for raising computer system stability according to claim 1 and 2 is characterized in that said misdata is the blocking information of processor core;
According to said misdata corresponding equipment in the said computer system is carried out the process that recovering state is handled, comprising:
Forbid that according to the blocking information of said processor core corresponding processing device nuclear is participated in the selection of PBSP in the said computer system, or forbid corresponding processing device nuclear in the said computer system.
6. the method for raising computer system stability according to claim 1 and 2 is characterized in that said misdata is the buffer memory blocking information;
According to said misdata corresponding equipment in the said computer system is carried out the process that recovering state is handled, comprising:
Again close corresponding cache in the said computer system according to said buffer memory blocking information.
7. a method that improves computer system stability is characterized in that, comprising:
Computer system is in when operation, collects the abnormal information that unusual equipment takes place in the equipment of said computer system;
The said abnormal information that unusual equipment takes place is stored in the nonvolatile memory;
The said equipment that has rolled off the production line that said computer system is reached the standard grade to request according to said abnormal information again carries out recovering state.
8. the method for raising computer system stability according to claim 7; It is characterized in that; Computer system is after the abnormal information that unusual equipment the time will take place in operation stores in the said nonvolatile memory; Also comprise: said computer system judges whether said equipment is replaced, if then delete the said abnormal information in the said nonvolatile memory; Otherwise, carry out said recovering state.
9. a computer system is characterized in that, comprising:
The mistake collector unit is used for when computer system operation or startup, collecting the misdata of the equipment generation of said computer system;
Storage unit is used for storing said misdata into nonvolatile memory;
Recover processing unit, be used for when said computer system is restarted, according to said misdata the equipment that produces said misdata is carried out recovering state and handle.
10. computer system according to claim 9 is characterized in that, said misdata is the abnormal information of said equipment;
Said recovery processing unit also is used for when the operation of said computer system according to said abnormal information, and the said equipment that has rolled off the production line that request is reached the standard grade again carries out recovering state.
11., it is characterized in that said misdata is a dual inline type memory module DIMM isolation information according to claim 9 or 10 described computer systems; Said recovery processing unit specifically is used for according to the DIMM isolation information the corresponding DIMM of said computer system being isolated.
12. computer system according to claim 11 is characterized in that, also comprises:
The replacement judging unit is used for when said computer system is moved, judging whether said DIMM is replaced;
The erasing of information unit is used for if the said DIMM of said replacement judgment unit judges is replaced, and the said DIMM isolation information that then will be stored in the said nonvolatile memory is removed;
Said recovery processing unit is used for then when said computer system is restarted, said DIMM being isolated if the said DIMM of said replacement judgment unit judges is not replaced.
13., it is characterized in that said misdata is the blocking information of processor core according to claim 9 or 10 described computer systems; Said recovery processing unit specifically is used for forbidding according to the blocking information of processor core the selection of said computer system corresponding processing device nuclear participation PBSP, or forbids corresponding processing device nuclear in the said computer system.
14., it is characterized in that said misdata is the buffer memory blocking information according to claim 9 or 10 described computer systems; Said recovery processing unit specifically is used for closing said computer system corresponding cache again according to the buffer memory blocking information.
15. a computer system is characterized in that, comprising:
The abnormal information collector unit is used for when computer system is moved, and collects the abnormal information that unusual equipment takes place in the equipment of said computer system;
Storage unit is used for storing the said abnormal information that unusual equipment takes place into nonvolatile memory;
The recovering state unit is used for carrying out recovering state according to the said equipment that has rolled off the production line that said abnormal information is reached the standard grade to request again.
16. computer system according to claim 15 is characterized in that, also comprises:
The replacement judging unit, the abnormal information that is used for when the operation of said computer system, will taking place unusual equipment judges whether said equipment is replaced after storing said nonvolatile memory into;
The information deletion unit is used for then deleting the said abnormal information in the said nonvolatile memory if the said equipment of said replacement judgment unit judges is to be replaced;
Said recovering state unit is used for then carrying out said recovering state if the said equipment of said replacement judgment unit judges is not replaced.
CN2011800015124A 2011-08-31 2011-08-31 Method for improving stability of computer system and computer system Pending CN102369513A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/079198 WO2012119432A1 (en) 2011-08-31 2011-08-31 Method for improving stability of computer system, and computer system

Publications (1)

Publication Number Publication Date
CN102369513A true CN102369513A (en) 2012-03-07

Family

ID=45761447

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011800015124A Pending CN102369513A (en) 2011-08-31 2011-08-31 Method for improving stability of computer system and computer system

Country Status (2)

Country Link
CN (1) CN102369513A (en)
WO (1) WO2012119432A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514068A (en) * 2012-06-28 2014-01-15 北京百度网讯科技有限公司 Method for automatically locating internal storage faults
CN104346265A (en) * 2013-07-29 2015-02-11 比亚迪股份有限公司 Terminal equipment and acquisition method and device for log information thereof
CN104809051A (en) * 2014-01-28 2015-07-29 国际商业机器公司 Method and device for forecasting anomalies and breakdown in computer application
CN106598790A (en) * 2015-10-16 2017-04-26 中兴通讯股份有限公司 Server hardware failure detection method, apparatus of server, and server
CN107077303A (en) * 2014-12-22 2017-08-18 英特尔公司 Distribution and configuration long-time memory
CN110955569A (en) * 2019-11-26 2020-04-03 英业达科技有限公司 Method, system, medium, and apparatus for testing dual inline memory module
CN111913825A (en) * 2020-07-31 2020-11-10 赵鑫飚 Big data based solution recommendation system and method
CN112732477A (en) * 2021-04-01 2021-04-30 四川华鲲振宇智能科技有限责任公司 Method for fault isolation by out-of-band self-checking

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101099135A (en) * 2004-12-03 2008-01-02 英特尔公司 Prevention of data loss due to power failure
CN101126995A (en) * 2006-08-14 2008-02-20 国际商业机器公司 Method and apparatus for processing serious hardware error

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101099135A (en) * 2004-12-03 2008-01-02 英特尔公司 Prevention of data loss due to power failure
CN101126995A (en) * 2006-08-14 2008-02-20 国际商业机器公司 Method and apparatus for processing serious hardware error

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514068A (en) * 2012-06-28 2014-01-15 北京百度网讯科技有限公司 Method for automatically locating internal storage faults
CN104346265A (en) * 2013-07-29 2015-02-11 比亚迪股份有限公司 Terminal equipment and acquisition method and device for log information thereof
CN104346265B (en) * 2013-07-29 2018-03-27 比亚迪股份有限公司 The acquisition methods and device of terminal device and its log information
CN104809051A (en) * 2014-01-28 2015-07-29 国际商业机器公司 Method and device for forecasting anomalies and breakdown in computer application
CN104809051B (en) * 2014-01-28 2017-11-14 国际商业机器公司 Method and apparatus for predicting exception and failure in computer application
US9823954B2 (en) 2014-01-28 2017-11-21 International Business Machines Corporation Predicting anomalies and incidents in a computer application
CN107077303B (en) * 2014-12-22 2022-11-15 英特尔公司 Allocating and configuring persistent memory
CN107077303A (en) * 2014-12-22 2017-08-18 英特尔公司 Distribution and configuration long-time memory
CN106598790A (en) * 2015-10-16 2017-04-26 中兴通讯股份有限公司 Server hardware failure detection method, apparatus of server, and server
CN110955569B (en) * 2019-11-26 2021-10-01 英业达科技有限公司 Method, system, medium, and apparatus for testing dual inline memory module
CN110955569A (en) * 2019-11-26 2020-04-03 英业达科技有限公司 Method, system, medium, and apparatus for testing dual inline memory module
CN111913825A (en) * 2020-07-31 2020-11-10 赵鑫飚 Big data based solution recommendation system and method
CN111913825B (en) * 2020-07-31 2021-04-27 山西泰森科技股份有限公司 Big data based solution recommendation system and method
CN112732477A (en) * 2021-04-01 2021-04-30 四川华鲲振宇智能科技有限责任公司 Method for fault isolation by out-of-band self-checking

Also Published As

Publication number Publication date
WO2012119432A1 (en) 2012-09-13

Similar Documents

Publication Publication Date Title
CN102369513A (en) Method for improving stability of computer system and computer system
CN101221508B (en) Equipment starting method and device
CN102298545B (en) System startup boot processing method and device
JP6326745B2 (en) Battery control device, battery charge capacity diagnosis method, and battery charge capacity diagnosis program
US9244773B2 (en) Apparatus and method for handling abnormalities occurring during startup
CN110704287B (en) RAID card abnormal log collection method and system under Linux system and storage medium
CN102915260B (en) The method that solid state hard disc is fault-tolerant and solid state hard disc thereof
JP2011170589A (en) Storage control device, storage device, and storage control method
US11726873B2 (en) Handling memory errors identified by microprocessors
CN111240903A (en) Data recovery method and related equipment
JP6880961B2 (en) Information processing device and log recording method
KR101258589B1 (en) Information storage medium recording data according to journaling file system, method and apparatus of writing/recovering data using journaling file system
US11537468B1 (en) Recording memory errors for use after restarts
US20100169572A1 (en) Data storage method, apparatus and system for interrupted write recovery
JP5910413B2 (en) Information processing apparatus, activation program, and activation method
CN111813748B (en) File system mounting method and device, electronic equipment and storage medium
US20100162082A1 (en) Control device, storage apparatus and controlling method
CN112486720A (en) Method for improving stability of computer system and computer system
US8108740B2 (en) Method for operating a memory device
CN118051383A (en) Partition damage switching backup method and system
CN108920210A (en) A kind of method, system and the associated component of load store control software
JP6287055B2 (en) Information processing apparatus, information collection method, and information collection program
US11250929B2 (en) System for detecting computer startup and method of system
CN106874161B (en) Method and device for processing cache exception
JP7180319B2 (en) Information processing device and dump management method for information processing device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120307