A kind of method and device of memory early warning
Technical field
The present invention relates to server technology field, more particularly to a kind of method and device of memory early warning.
Background technology
With the development of server technology, the memory size configured in server is increasing, and the speed of memory operation is got over
Come it is higher, these high-capacity and high-speeds operation memory become most influence system stability failure area occurred frequently, how in memory
Judge in advance before catastrophe failure occurs and disposes as an important demand of system stability reliability and technological difficulties.
Currently, be provided with hopper count device in the server, what which can record that every memory occurs entangles
The number of lookup error.BMC meeting automatic regular polling hopper count devices, when the number for monitoring correctable error reaches predetermined threshold value,
Early warning can be then triggered, to prompt server admin personnel to carry out troubleshooting.
The inventor finds that the existing technology has at least the following problems:
Certain memory failures belong to soft fault (such as:Bit reversal caused by cosmic ray etc.), this kind of failure can be one
Fix time it is interior restore normal, not correctable error.It is this that early warning is carried out in a manner of pure counting, time dimension is not considered
Factor causes early warning to report by mistake, in turn results in the operation and maintenance inefficiency of server.
Invention content
In order to solve the problems in the prior art, an embodiment of the present invention provides a kind of method and devices of memory early warning.
The technical solution is as follows:
In a first aspect, a kind of method of memory early warning provided in an embodiment of the present invention, the method includes:
The modes of warning currently set is obtained, and obtains the corresponding funnel frequency of the modes of warning and threshold value of warning,
The funnel frequency is the numerical value per second for carrying out subtraction operation in preset hopper count device;
The number that correctable error occurs in the memory of current monitor is recorded in the hopper count device, and according to institute
It states funnel frequency and subtraction operation is carried out to the hopper count device;
When monitoring, the number recorded in the hopper count device reaches the threshold value of warning and each repair mode has used
When complete, early warning is sent out to baseboard management controller BMC.
It is described to obtain the corresponding early warning threshold of the modes of warning in the first possible realization method of first aspect
Value, including:
Obtain the corresponding duration parameters of modes of warning and multiple parameter;
According to the duration parameters, the multiple parameter and the funnel parameters calculate the modes of warning pair
The threshold value of warning answered.
The possible realization method of with reference to first aspect the first, in second of possible realization method of first aspect
In, it is described calculate the corresponding threshold value of warning of the modes of warning after, the method further includes:
If the threshold value of warning calculated has been more than upper limit value, it sets the threshold value of warning to the upper limit value.
The possible realization method of with reference to first aspect the first, in the third possible realization method of first aspect
In, the modes of warning includes:Accurate sexual norm and coverage rate pattern;
The corresponding funnel frequency of the accurate sexual norm is more than the corresponding funnel frequency of the coverage rate pattern;
The corresponding duration parameters of the accurate sexual norm are more than the corresponding duration parameters of the coverage rate pattern.
The possible realization method of with reference to first aspect the first, in the 4th kind of possible realization method of first aspect
In, the funnel frequency is 1/second between 100/second;The duration parameters are between 3 seconds to 60 seconds;Described times
Between number parameter is 10 to 100.
It is described to be recorded in the hopper count device when monitoring in the 5th kind of possible realization method of first aspect
Number reaches the threshold value of warning and when each repair mode has used, and early warning is sent out to BMC, including:
When monitoring the number recorded in the hopper count device and reaching the threshold value of warning, system management interrupt is triggered
SMI interrupt;
When the SMI terminal handers detect that each repair mode has used, sends pre-alert notification to BMC and disappear
Breath.
Second aspect, a kind of device of memory early warning provided in an embodiment of the present invention, described device include:
Acquisition module for obtaining the modes of warning currently set, and obtains the corresponding funnel frequency of the modes of warning
And threshold value of warning, the funnel frequency are the numerical value per second for carrying out subtraction operation in preset hopper count device;
Processing module, the number for correctable error to occur in the memory by current monitor are recorded in the hopper count
In device, and subtraction operation is carried out to the hopper count device according to the funnel frequency;
Warning module, for reaching the threshold value of warning and respectively repairing when monitoring the number recorded in the hopper count device
When compound formula has used, early warning is sent out to baseboard management controller BMC.
In the first possible realization method of second aspect, the acquisition module, including:
Acquiring unit, for obtaining the corresponding duration parameters of modes of warning and multiple parameter;
Computing unit, for according to the duration parameters, the multiple parameter and the funnel parameters, calculating institute
State the corresponding threshold value of warning of modes of warning.
In conjunction with the first possible realization method of second aspect, in second of possible realization method of second aspect
In, the acquisition module further includes:
The threshold value of warning is arranged if the threshold value of warning for calculating has been more than upper limit value for setting unit
For the upper limit value.
In conjunction with the first possible realization method of second aspect, in the third possible realization method of second aspect
In, the modes of warning includes:Accurate sexual norm and coverage rate pattern;
The corresponding funnel frequency of the accurate sexual norm is more than the corresponding funnel frequency of the coverage rate pattern;
The corresponding duration parameters of the accurate sexual norm are more than the corresponding duration parameters of the coverage rate pattern.
In conjunction with the first possible realization method of second aspect, in the 4th kind of possible realization method of second aspect
In, the funnel frequency is 1/second between 100/second;The duration parameters are between 3 seconds to 60 seconds;Described times
Between number parameter is 10 to 100.
In the 5th kind of possible realization method of second aspect, the warning module, including:
Interrupt location, for when monitoring the number recorded in the hopper count device and reaching the threshold value of warning, touching
Send out system management interrupt SMI interrupt;
Transmission unit, for when the SMI terminal handers detect that each repair mode has used, being sent out to BMC
Send pre-alert notification message.
The advantageous effect that technical solution provided in an embodiment of the present invention is brought is:
By obtaining funnel frequency and threshold value of warning in current modes of warning, in hopper count device to every memory
When correctable error is counted, subtraction operation is carried out to hopper count device by funnel frequency, and reach in hopper count device
Threshold value of warning and when each repair mode has used, early warning is sent out to BMC.Realize the correcting to memory on time dimension
The alarm mode of mistake improves the accuracy and actual effect of early warning.
Description of the drawings
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for
For those of ordinary skill in the art, without creative efforts, other are can also be obtained according to these attached drawings
Attached drawing.
Fig. 1 is the method flow diagram for the memory early warning that the embodiment of the present invention 1 provides;
Fig. 2 be the embodiment of the present invention 1 provide memory early warning method in memory failure early warning mathematical model schematic diagram;
Fig. 3 is the method flow diagram for the memory early warning that the embodiment of the present invention 2 provides;
Fig. 4 is the apparatus structure schematic diagram for the memory early warning that the embodiment of the present invention 3 provides;
Fig. 5 is the structural schematic diagram for the server that the embodiment of the present invention 4 provides.
Specific implementation mode
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention
Formula is described in further detail.
Embodiment 1
An embodiment of the present invention provides a kind of methods of memory early warning, referring to Fig. 1.
Wherein, this method includes:
101:The modes of warning currently set is obtained, and obtains the corresponding funnel frequency of modes of warning and threshold value of warning,
Funnel frequency is the numerical value per second for carrying out subtraction operation in preset hopper count device;
102:The number that correctable error occurs in the memory of current monitor is recorded in hopper count device, and according to leakage
The frequency that struggles against carries out subtraction operation to hopper count device;
103:Reach threshold value of warning and when each repair mode has used when monitoring the number recorded in hopper count device,
Early warning is sent out to BMC (Baseboard Management Controller, baseboard management controller).
As shown in Fig. 2, showing memory failure early warning mathematical model in the figure, horizontal axis is represented and can not be entangled wherein in the model
The frequency that lookup error occurs, the longitudinal axis represent the possibility that catastrophe failure occurs for system under the ECC frequencies;It can be right by the model
Memory failure carries out qualitative analysis and obtains as drawn a conclusion:
The possibility that catastrophe failure occurs for the bigger system of frequency that correctable error does not occur is namely bigger;
The funnel frequency threshold choosing of the frequency of correctable error generation is bigger, that is, the accuracy of early warning is bigger;
The coverage rate of early warning is lower.
The embodiment of the present invention is by obtaining funnel frequency and threshold value of warning in current modes of warning, in hopper count device
When being counted to the correctable error of every memory, subtraction operation is carried out to hopper count device by funnel frequency, and leaking
Bucket counter reaches threshold value of warning and when each repair mode has used, and early warning is sent out to BMC.It is right on time dimension to realize
The alarm mode of the correctable error of memory improves the accuracy and actual effect of early warning.
Embodiment 2
An embodiment of the present invention provides a kind of methods of memory early warning, referring to Fig. 3.
Wherein, this method includes:
301:The modes of warning currently set is obtained, and obtains the corresponding funnel frequency of modes of warning and threshold value of warning,
Funnel frequency is the numerical value per second for carrying out subtraction operation in preset hopper count device.
Wherein, modes of warning includes:Accurate sexual norm and coverage rate pattern.
Raw EMS memory error, which is issued, in accurate sexual norm needs to have following characteristics:The frequency height and energy that EMS memory error occurs
It continues for some time;The EMS memory error occurred under coverage rate pattern needs to have following characteristics:The funnel frequency of setting is less than
Certain threshold value.
Include the parameter that can be set in each modes of warning, may include:Funnel frequency, threshold value of warning.Optionally,
Threshold value of warning can be that other parameters are calculated in modes of warning.Correspondingly, the corresponding calculating parameter of threshold value of warning can
To include:Duration parameters and multiple parameter.Duration parameters are the duration that mistake occurs for memory, and the duration is too
It is short, it cannot be confirmed whether to break down, it is more long then more accurate.
It therefore, can be just like drawing a conclusion after accurate sexual norm is compared between coverage rate pattern relatively:
The corresponding funnel frequency of accurate sexual norm is more than the corresponding funnel frequency of coverage rate pattern;
Wherein, accuracy requirement is higher, then funnel frequency is bigger.
The corresponding duration parameters of accurate sexual norm are more than the corresponding duration parameters of coverage rate pattern.
Wherein, funnel frequency is 1/second between 100/second;Duration parameters are between 3 seconds to 60 seconds;Multiple
Between parameter is 10 to 100.
Since the frequency of the correctable error actually occurred is more than funnel frequency, it is contemplated that can be corrected to higher generation
The case where frequency of mistake, threshold value of warning need to amplify 10~100 times by multiple parameter, and wherein accuracy requirement is higher, then
Multiple is bigger.
Server admin personnel can be directed to the characteristic of each modes of warning, be configured in server admin, select
Meet the modes of warning currently needed.
Wherein, currently in the hopper count mechanism of the correctable error of memory, there is also there is funnel frequency, this can
With the parameter of setting.After provided with the funnel frequency, hopper count device can occur in the memory for recording each monitoring can
While correcting the number of mistake, also subtraction operation can be carried out to hopper count device according to the funnel frequency, to reduce in terms of funnel
The number recorded in number device.
The reason of memory breaks down is broadly divided into soft fault (bit reversal as caused by cosmic ray), transient fault
(crosstalk of such as data line), hard fault (damage of such as memory grain or the failure of certain bit).
Soft fault may instantaneously cause more memory to report an error and non-persistent, and this kind of EMS memory error can be entangled quickly
Just, the mistake will not then be generated again after a certain time.But due to the meeting in hopper count device of mistake caused by soft fault
It is recorded, the number of these records can trigger early warning if the threshold value for reaching specified, and early warning is caused to report by mistake.And by using leakage
The frequency that struggles against carries out subtraction operation to hopper count device, and the soft fault recorded in hopper count device can be made on time dimension
The number that reports an error is reduced after some period of time, then will not trigger early warning.
In the embodiments of the present disclosure, step 301 can be realized by following steps:
3011:Obtain the modes of warning currently set;
3012:Obtain the corresponding funnel frequency of modes of warning;
3013:Obtain the corresponding duration parameters of modes of warning and multiple parameter;
3014:According to duration parameters, multiple parameter and funnel parameters calculate the corresponding early warning threshold of modes of warning
Value.
Wherein, the calculation formula of threshold value of warning can be:
Threshold value of warning=duration parameters * funnel frequency * multiple parameters;
Such as:It can be arranged:Funnel frequency=10/second;Duration parameters=50 second;Multiple parameter is 10, then in advance
Alert threshold value is 5000.
In addition, extreme case of less demanding to accuracy and that requirement is failed to report less:
Funnel frequency=1/second;Maximum length in time=3 second;Multiple parameter is 10, then threshold value of warning is 30;
General real system is more demanding to accuracy but is also contemplated for the practical balance failed to report less simultaneously, therefore can be with
It is configured as follows:
Funnel frequency=1/second;Maximum length in time=60 second;Multiple parameter is 100, then threshold value of warning is 6000.
Further, the threshold value of warning being calculated is excessive to be may be such that the exigent extreme case of accuracy,
Therefore it needs to limit the upper limit of threshold value of warning, if detecting that threshold value of warning is more than upper limit value, need to early warning threshold
Value is re-set as upper limit value, when this situation is detected, needs to execute step 3015.
3015:If the threshold value of warning calculated has been more than upper limit value, it sets threshold value of warning to upper limit value.
Such as:To the exigent extreme case of accuracy:
Funnel frequency=100/second, maximum length in time=60 second, multiple parameter is 100, then threshold value of warning is
600000, wherein upper limit value is 32767, therefore threshold value of warning is set as 32767.
302:The number that correctable error occurs in the memory of current monitor is recorded in hopper count device, and according to leakage
The frequency that struggles against carries out subtraction operation to hopper count device.
Such as:The frequency of the number of correctable error is 100/second, and funnel frequency is 50/second, then hopper count device
In it is per second be incremented by 50 numbers.
303:Reach threshold value of warning and when each repair mode has used when monitoring the number recorded in hopper count device,
Early warning is sent out to BMC.
Wherein, when the number recorded in hopper count device reaches threshold value of warning, can start RAS (Reliability,
Availability, Serviceability, reliability, availability, serviceability) memory reparation operation, RAS memory reparations behaviour
Make the corresponding cost that can be solved memory grain failure of removal, but can pay.
When the number recorded in hopper count device reaches threshold value of warning and various repair modes have all used, need
Execute Device Tagging processing functions, Device Tagging processing functions be after the various repair functions in front are all finished most
The processing function that the number recorded in a hopper count device afterwards can just be used, but it can lose the correction energy of single-bit error
Power.
Device Tagging are performed as crossed, then it represents that RAS processing action has all been used up, SMI interrupt handler hair
Send pre- alarm notification to BMC.
Wherein, the mode for sending out early warning can be that following manner is realized:
3031:When monitoring the number recorded in the hopper count device and reaching threshold value of warning, SMI (System are triggered
Management Interrupt, system management interrupt) it interrupts;
3032:When SMI terminal handers detect that each repair mode has used, sends pre-alert notification to BMC and disappear
Breath.
Using the method for memory early warning provided in an embodiment of the present invention, it can ensure the early warning of its output in terms of 3
It is accurately and reliably:
A) high-frequency correctable error can greatly improve the possibility that correctable error does not occur, ought to early warning;
B) high-frequency correctable error causes frequent error correction that can seriously affect the performance of system, ought to early warning;
C) (whether enabled Device Tagging users be optional, default opening) is due to being more than that can trigger Device after threshold value
Tagging solves individual particle failure of removal, but also results in the RANK memories simultaneously and lose single-bit error correction capability,
It, ought to early warning in dangerous edge.
The embodiment of the present invention is by obtaining funnel frequency and threshold value of warning in current modes of warning, in hopper count device
When being counted to the correctable error of every memory, subtraction operation is carried out to hopper count device by funnel frequency, and leaking
Bucket counter reaches threshold value of warning and when each repair mode has used, and early warning is sent out to BMC.It is right on time dimension to realize
The alarm mode of the correctable error of memory improves the accuracy and actual effect of early warning.
Embodiment 3
An embodiment of the present invention provides a kind of devices of memory early warning, referring to Fig. 4.The device includes:
Acquisition module 401, for obtaining the modes of warning currently set, and obtain the corresponding funnel frequency of modes of warning with
And threshold value of warning, funnel frequency are the numerical value per second for carrying out subtraction operation in preset hopper count device;
Processing module 402, the number for correctable error to occur in the memory by current monitor are recorded in hopper count
In device, and subtraction operation is carried out to hopper count device according to funnel frequency;
Warning module 403 reaches threshold value of warning and each reparation side for that ought monitor the number recorded in hopper count device
When formula has used, early warning is sent out to baseboard management controller BMC.
Optionally, acquisition module 401, including:
Acquiring unit, for obtaining the corresponding duration parameters of modes of warning and multiple parameter;
Computing unit, for according to duration parameters, multiple parameter and funnel parameters, it is corresponding to calculate modes of warning
Threshold value of warning.
Optionally, acquisition module 401 further includes:
Setting unit sets threshold value of warning to upper limit value if the threshold value of warning for calculating has been more than upper limit value.
Optionally, modes of warning includes:Accurate sexual norm and coverage rate pattern;
The corresponding funnel frequency of accurate sexual norm is more than the corresponding funnel frequency of coverage rate pattern;
The corresponding duration parameters of accurate sexual norm are more than the corresponding duration parameters of coverage rate pattern.
Optionally, funnel frequency is 1/second between 100/second;Duration parameters are between 3 seconds to 60 seconds;Times
Between number parameter is 10 to 100.
Optionally, warning module 403, including:
Interrupt location, for when monitoring the number recorded in the hopper count device and reaching the threshold value of warning, touching
Send out system management interrupt SMI interrupt;
Transmission unit, for when the SMI terminal handers detect that each repair mode has used, being sent out to BMC
Send pre-alert notification message.
The embodiment of the present invention is by obtaining funnel frequency and threshold value of warning in current modes of warning, in hopper count device
When being counted to the correctable error of every memory, subtraction operation is carried out to hopper count device by funnel frequency, and leaking
Bucket counter reaches threshold value of warning and when each repair mode has used, and early warning is sent out to BMC.It is right on time dimension to realize
The alarm mode of the correctable error of memory improves the accuracy and actual effect of early warning.
Embodiment 4
An embodiment of the present invention provides a kind of server,
Its structure is referring to Fig. 5, wherein the server includes:Memory 501 and at least one processor 502, processor
502 are configured as executing following operation:
The modes of warning currently set is obtained, and obtains the corresponding funnel frequency of modes of warning and threshold value of warning, funnel
Frequency is the numerical value per second for carrying out subtraction operation in preset hopper count device;
The number that correctable error occurs in the memory of current monitor is recorded in hopper count device, and according to funnel frequency
Rate carries out subtraction operation to hopper count device;
Reach threshold value of warning and when each repair mode has used when monitoring the number recorded in hopper count device, Xiang Ji
Board management controller BMC sends out early warning.
Wherein, the corresponding threshold value of warning of modes of warning is obtained, including:
Obtain the corresponding duration parameters of modes of warning and multiple parameter;
According to duration parameters, multiple parameter and funnel parameters calculate the corresponding threshold value of warning of modes of warning.
Wherein, after calculating the corresponding threshold value of warning of modes of warning, method further includes:
If the threshold value of warning calculated has been more than upper limit value, it sets threshold value of warning to upper limit value.
Wherein, modes of warning includes:Accurate sexual norm and coverage rate pattern;
The corresponding funnel frequency of accurate sexual norm is more than the corresponding funnel frequency of coverage rate pattern;
The corresponding duration parameters of accurate sexual norm are more than the corresponding duration parameters of coverage rate pattern.
Wherein, funnel frequency is 1/second between 100/second;Duration parameters are between 3 seconds to 60 seconds;Multiple
Between parameter is 10 to 100.
Wherein, when monitoring, the number recorded in hopper count device reaches threshold value of warning and each repair mode has used
When, early warning is sent out to BMC, including:
When monitoring the number recorded in the hopper count device and reaching the threshold value of warning, system management interrupt is triggered
SMI interrupt;
When the SMI terminal handers detect that each repair mode has used, sends pre-alert notification to BMC and disappear
Breath.
The embodiment of the present invention is by obtaining funnel frequency and threshold value of warning in current modes of warning, in hopper count device
When being counted to the correctable error of every memory, subtraction operation is carried out to hopper count device by funnel frequency, and leaking
Bucket counter reaches threshold value of warning and when each repair mode has used, and early warning is sent out to BMC.It is right on time dimension to realize
The alarm mode of the correctable error of memory improves the accuracy and actual effect of early warning.
One of ordinary skill in the art will appreciate that realizing that all or part of step of above-described embodiment can pass through hardware
Complete, relevant hardware can also be instructed to complete by program, program can be stored in a kind of computer-readable storage
In medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and
Within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.