Background technology
ECC (ErrorCorrectingCode, error checking and correction) internal memory refers to the internal memory applying EEC technology, is generally applied on server and graphics workstation more, is operationally more tending towards safety and stability to make whole computer work system.
Before EEC technology occurs, applying maximum technology in internal memory is Parity (parity check) technology.In digital circuit, minimum data unit is bit (bit), is also data " position ", and bit is also the least unit in internal memory, and it represents the high and low level signal of data by " 1 " and " 0 ".In digital circuit, 8 continuous print bits are bytes (byte).In the internal memory not with " parity check ", each byte only has 8, if its a certain position stores occurred mistake, the corresponding data wherein stored will be made to change and cause application program to make a mistake.And be used for carrying out error detection with adds additional one outside each byte of interior existence (8) of " parity check ".A certain numerical value (1,0,1,0,1,0,1,1) is stored, each add up mutually (1+0+1+0+1+0+1+1=5) in a such as byte.If its result is odd number, for even parity check, check digit is just defined as 1, otherwise then orientates 0 as; For odd, then on the contrary.When CPU returns the data reading and store, it can be added the data stored in first 8 again, and whether result of calculation is consistent with check digit.Will attempt correcting these mistakes when inconsistent both CPU finds.But the deficiency of Parity is: when internal memory find certain data bit wrong time, but might not determine in which position, also just differing corrects mistakes surely.So be only " finding mistake " with the major function of the internal memory of parity check, and the simple mistake of part can be corrected.
By analysis above, we know, Parity internal memory is by increasing the correctness that a data bit checks current 8 bit data on the basis of original data bit, but the data bit that the increase Parity along with data bit is used for checking also is multiplied, in other words when data bit is 16, it needs increase by 2 for checking, then need increase by 4 when data bit is 32, the rest may be inferred.Particularly when data volume is very large, the probability of corrupt data is also larger, method for the odd-even check can only correcting easy bugs just seems and cannot satisfy the demands, just based on such a case, a kind of new memory techniques has been arisen at the historic moment, Here it is ECC internal memory, this memory techniques is also that additional check digit realizes in original data bit.The method increased unlike both is different, and this major function that also just result in both is not quite alike.If it and Parity are 8 unlike data bit, then need increase by 5 to carry out error checking and correction, data bit often doubles, ECC only increases by a bit trial position, and that is when data bit is 16, ECC position is 6, and when 32, ECC position is 7, when data bit is 64, ECC position is 8, the rest may be inferred, and data bit often doubles, and ECC position only increases by one.In a word, in internal memory, ECC can allow mistake, and can by error correction, make system be continued normal operation, not reason mistake and interrupting, and ECC have the automatic identification more advanced than Parity, the ability of corrigendum, the error bit that Parity cannot check out can be found and by error correction, but ECC only can correct the mistake of individual bit, when error bit is more than a bit, then cannot correct.
Hardware table item can be kept in the register of forwarding chip inside, also can be kept in forwarding chip inside or plug-in RAM (RandomAccessMemory, random access memory), and can 32bits be taken, or multiple 32bits, it act as the forwarding instructing message.
Hardware table item is issued in hardware process and can produces parity check or ECC value is issued in device, and different components can use different hardware table item check errors methods.In message repeating process, read this list item and whether contrast check value identical with original value.If not identical, then determine that data make a mistake.
Current hardware list item method of calibration has been widely applied in each device, but because Parity can not correct a mistake, and ECC also only can correct the mistake of individual bit, in prior art, do not provide a kind of and the Restoration Mechanism after hardware table item check errors detected, after causing hardware table item check errors being detected, whole forwarding board must be changed.
Summary of the invention
The invention provides a kind of processing method and device of hardware table item check errors, to improve the reliability and maintainability of system.
In order to reach above object, embodiments providing a kind of processing method of hardware table item check errors, comprising:
The hardware table item check errors information of this device of the check information register record of acquisition device; Wherein, described hardware table item check errors information comprises the address of hardware table item of makeing mistakes and number of times of makeing mistakes;
The number of times of makeing mistakes of hardware table item of determining to make mistakes in described hardware table item check errors information in the first Preset Time exceedes threshold value, then determine the index of this hardware table item of makeing mistakes according to the address of this hardware table item of makeing mistakes;
The software list item corresponding according to the search index of described hardware table item of makeing mistakes, and hardware table item of makeing mistakes described in refreshing according to the software list item inquired;
Determine that the check information register of described device in the second Preset Time does not record new hardware table item check errors information, then keep described devices function state.
Wherein, the index of described hardware table item of determining to make mistakes according to described hardware table item check errors information, realizes especially by following formula:
Wherein,
for rounding downwards (*), i is the index of hardware table item of makeing mistakes, and Ad0 is the initial address of hardware table item in internal memory, and Ad1 is that number of times of makeing mistakes exceedes the address of the hardware table item of makeing mistakes of threshold value, and S is hardware list item size, and unit is byte.
Wherein, when the described device broken down has reset function, the method also comprises:
Determine that the check information register of described device in described second Preset Time have recorded new hardware table item check errors information, then the device broken down described in triggering resets; Or,
Determine that the check information register of described device in described second Preset Time have recorded new hardware table item check errors information, then lower electric treatment is carried out to the described device broken down, and re-power.
Wherein, when the described device broken down has reset function, the method also comprises:
Determine that the check information register of described device in described second Preset Time have recorded new hardware table item check errors information, then the device broken down described in triggering resets;
When described fault is repaired not yet, lower electric treatment is carried out to the described device broken down, and re-powers.
Wherein, the method also comprises:
Determine that the check information register of described device in described second Preset Time have recorded new hardware table item check errors information, then point out device fault described in user.
The embodiment of the present invention additionally provides a kind of processing unit of hardware table item check errors, comprising:
Acquisition module, for the hardware table item check errors information of this device of the check information register record of acquisition device; Wherein, described hardware table item check errors information comprises the address of hardware table item of makeing mistakes and number of times of makeing mistakes;
Fault detection module, whether the number of times of makeing mistakes of hardware table item of determining to make mistakes in described hardware table item check errors information in the first predetermined time period exceedes threshold value;
Fault restoration module, when number of times of makeing mistakes for hardware table item of determining when described fault detection module to make mistakes in described hardware table item check errors information in the first predetermined time period exceedes threshold value, determine the index of this hardware table item of makeing mistakes according to the address of this hardware table item of makeing mistakes; The software list item corresponding according to the search index of described hardware table item of makeing mistakes, and hardware table item of makeing mistakes described in refreshing according to the software list item inquired; Determine that the check information register of described device in the second Preset Time does not record new hardware table item check errors information, then keep described devices function state.
Wherein, described fault restoration module is specifically for, the index of the hardware table item that realizes determining to make mistakes according to described hardware table item check errors information by following formula:
Wherein,
for rounding downwards (*), i is the index of hardware table item of makeing mistakes, and Ad0 is the initial address of hardware table item in internal memory, and Ad1 is that number of times of makeing mistakes exceedes the address of the hardware table item of makeing mistakes of threshold value, and S is hardware list item size, and unit is byte.
Wherein, when the described device broken down has reset function,
Described fault restoration module also for, determine that the check information register of described device have recorded new hardware table item check errors information, then described in triggering, the device that breaks down resets; Or, determine that the check information register of described device in described second Preset Time have recorded new hardware table item check errors information, then lower electric treatment is carried out to the described device broken down, and re-power.
Wherein, when the described device broken down has reset function,
Described fault restoration module also for, determine that the check information register of described device in described second Preset Time have recorded new hardware table item check errors information, then described in triggering, the device that breaks down resets; When described fault is repaired not yet, lower electric treatment is carried out to the described device broken down, and re-powers.
Wherein, described fault restoration module also for, determine that the check information register of described device in described second Preset Time have recorded new hardware table item check errors information, then point out device fault described in user.
In the above embodiment of the present invention, by the hardware table item check errors information of this device of the check information register record of acquisition device, the number of times of makeing mistakes of hardware table item of determining to make mistakes in described hardware table item check errors information in the first predetermined time period exceedes threshold value; The index of this hardware table item of makeing mistakes then is determined according to the address of this hardware table item of makeing mistakes; The software list item corresponding according to the search index of described hardware table item of makeing mistakes, and hardware table item of makeing mistakes described in refreshing according to the software list item inquired, keep described devices function state, to improve the reliability and maintainability of system.
Embodiment
For the above-mentioned problems in the prior art, embodiments provide a kind of technical scheme of process of hardware table item check errors.In this technical scheme, by the hardware table item check errors information of this device of the check information register record of acquisition device, the number of times of makeing mistakes of hardware table item of determining to make mistakes in described hardware table item check errors information in the first predetermined time period exceedes threshold value, then determine the index of this hardware table item of makeing mistakes according to the address of this hardware table item of makeing mistakes; The software list item corresponding according to the search index of described hardware table item of makeing mistakes, and hardware table item of makeing mistakes described in refreshing according to the software list item inquired, determine that the check information register of described device in the second Preset Time does not record new hardware table item check errors information, then keep described devices function state, keep described devices function state, to improve the reliability and maintainability of system.
Below in conjunction with the accompanying drawing in embodiments of the invention, carry out clear, complete description to the technical scheme in embodiments of the invention, obviously, the embodiments described below are only the present invention's part embodiments, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not paying the every other embodiment obtained under creative work prerequisite, all belong to the scope of embodiments of the invention protection.
As shown in Figure 1, the schematic flow sheet of the processing method of a kind of hardware table item check errors provided for the embodiment of the present invention, can comprise the following steps:
Step 101, acquisition hardware table item check errors information, and whether break down according to the hardware table item check errors information determination respective devices got.
Concrete, device due to support hardware error checking function is provided with corresponding check errors register usually, for storing the hardware table item check errors information of respective devices, therefore, in embodiments of the present invention, the hardware table item check errors information of corresponding device can be obtained from the check errors register of each device, and determine whether corresponding device breaks down according to the hardware table item check errors information got.Wherein, the hardware table item check errors information recorded in check errors register can comprise hardware table item of makeing mistakes address and the information such as number of times of makeing mistakes.
In actual applications, can often at regular intervals (periodically) from the check errors register of device, obtain the hardware table item check errors information of respective devices, be a polling cycle with certain inquiry times, when in a polling cycle, after the number of times of makeing mistakes of hardware table item of makeing mistakes exceedes threshold value, then determining device breaks down.
Wherein, for the defective device (for forwarding chip) that impact forwards, the calculating of packet loss can be increased.By the forwarding Mean Speed (PPS) of current wrong bag Mean Speed (PPS (PacketperSecond, packet is per second)) with forwarding chip, obtain packet loss.Be a sense cycle with certain hour (as 1S), the packet loss in the statistic mixed-state cycle, if packet loss all higher than pre-determined threshold, then determines forwarding chip fault in a N continuous sense cycle.Wherein, N is positive integer.
Such as, in a sense cycle (1S), wrong bag Mean Speed is 1100PSS, and the forwarding Mean Speed of forwarding chip is 10000PSS, then in this sense cycle, packet loss is 11% (1100/10000=11%).If packet loss threshold value is 10%, then in this sense cycle packet loss higher than threshold value.If defining continuous sense cycle is 8S, and detects the packet loss obtained in this continuous sense cycle at every turn and be all greater than 10%, then determine that forwarding chip breaks down.
Step 102, when determining device breaks down, fault restoration is carried out to this device, and when after described fault restoration success, keep described devices function state.
Concrete, in the prior art, when hardware table item check errors being detected, common processing mode is for directly to change device.But, in actual conditions, except the hardware table item check errors caused by hardware fault is necessary to change except device, many hardware table item check errors caused by other factors (as cosmic ray) can realize fault restoration by certain method.
In order to improve system reliability and maintainability, in the technical scheme that the embodiment of the present invention provides, after determining device breaks down, can pass through to carry out fault restoration to it in method:
Method 1, hardware table item flush mechanism
Because hardware table item is generally copied by the software list item of correspondence, therefore when hardware table item check errors being detected, and determining device is when breaking down, can to determine to make mistakes according to the address of hardware table item of makeing mistakes the index of hardware table item, and then the software list item corresponding according to this search index, and according to the software list item inquired, this hardware table item is refreshed.
Concrete, in embodiments of the present invention, after determining device breaks down, determine to make mistakes the index of list item in the address of the hardware table item of makeing mistakes that can comprise according to the hardware table item check errors information got, and then the software list item corresponding according to this search index, and with the content refresh hardware table item of the software list item inquired, to correct the check errors of hardware table item.
Wherein, the index of hardware table item of determining to make mistakes according to the address of hardware table item of makeing mistakes, can realize especially by following formula:
Wherein,
for rounding downwards (*), i is the index of hardware table item of makeing mistakes, and Ad0 is the initial address of hardware table item in internal memory, and Ad1 is that number of times of makeing mistakes exceedes threshold value and to make mistakes the address of hardware table item, and S is hardware list item size, and unit is byte.
Such as, in internal memory, the space (namely a hardware table item is 16Byte (byte)) of 4 32bits is taken for a hardware table item.As shown in Figure 2, the memory base address (i.e. the initial address of hardware table item in internal memory) of this hardware table item is 0x40000000,0x40080000 is the ending of list item, and namely in internal memory, this section uses (index of hardware table item is followed successively by 0,1,2...) to hardware table item.If hardware table item check errors information determines that the number of times of makeing mistakes of address 0x40000008 exceedes threshold value, then the index of hardware table item of makeing mistakes is:
Then 0 index that is hardware table item of makeing mistakes, finds corresponding software list item according to this index, and according to this hardware table item of makeing mistakes of the content refresh in this software list item.
Method 2, device reset mechanism
The component register of this device or whole register can be made to reinitialize because device resets, all software list items re-issue.Therefore, when hardware table item check errors being detected, can be resetted by device and the hardware table item of device being refreshed, to carry out fault restoration to device.
Concrete, in embodiments of the present invention, if device provides reset function, then after determining device breaks down, can reset by trigger device, reinitialize to make the component register of this device or whole register, all software list items re-issue, to repair the check errors of hardware table item.
Method 3, power-off reset (cold reset) mechanism
Due to when the power down and after re-powering of device experience, whole registers of device all can reinitialize, and all software list items re-issue.Therefore, when hardware table item check errors being detected, can be refreshed by the hardware table item of power-off reset to device, to carry out fault restoration to device.
Wherein, in above-mentioned three kinds of device fault restorative procedures that the embodiment of the present invention provides, because hardware table item flush mechanism only refreshes the determined hardware table item of makeing mistakes, device reset mechanism is then resetted by device and all refreshes the hardware table item in part or all of register, and power-off reset then all refreshes the hardware table item in whole registers of defective device (and with other devices of the defective device common source).Therefore, the successful possibility of these three kinds of fault repairing method prosthetic device faults increases successively, but also increases gradually the impact of service availability.
In embodiments of the present invention, when hardware table item check errors being detected, and after determining device breaks down, above-mentioned three kinds of methods can be used successively to carry out fault restoration, until fault restoration success.Namely when there is device fault, first hardware table item flush mechanism can be used to carry out fault restoration, and when the failure of hardware table item flush mechanism, use device reset mechanism to carry out fault restoration, when device reset mechanism also failure, power-off reset mechanism is used to carry out fault restoration.If after carrying out power-off reset, device fault is repaired not yet, then can point out user's device fault, change related device.Wherein, in embodiments of the present invention, can by predetermined time period, whether successfully whether the check information register of device have recorded new hardware table item check errors information determination fault restoration; When in predetermined time period, when the check information register of device have recorded new hardware table item check errors information, determine that fault restoration is unsuccessful; When in predetermined time period, when the check information register of device does not record new hardware table item check errors information, determine that fault restoration is unsuccessful.
Should be realized that, the execution mode performing above-mentioned three kinds of fault repairing methods is successively only a kind of instantiation of the technical scheme that the embodiment of the present invention provides, and is not limiting the scope of the present invention.Namely in the technical scheme provided in the embodiment of the present invention, also can after determining device fault, direct use second method (device reset mechanism) or the third method (power-off reset mechanism) carry out fault restoration, or, second method is first used to carry out fault restoration, and after the failure, use the third method to carry out fault restoration.On the basis of the fault repairing method that those skilled in the art provide in the embodiment of the present invention, do not paying the modification of carrying out under creative work prerequisite, and all should protection scope of the present invention belonged to the change of each method use order.
Further, in embodiments of the present invention, in order to ensure the availability of business as far as possible, if employ above-mentioned three kinds of restorative procedures successively to carry out fault restoration, and fault is repaired not yet, then within a certain period of time (this time can be determined according to service operation situation) no longer carries out the reparation flow process to this fault.
By describing above and can finding out, in embodiments of the present invention, whether break down by obtaining hardware table item check errors determination corresponding device, and when device failure, fault restoration is carried out to this device, improve the reliability and maintainability of system.
Based on the technical conceive that said method embodiment is identical, the embodiment of the present invention additionally provides a kind of processing unit of hardware table item check errors, can be applied to said method flow process.
As shown in Figure 3, the structural representation of the processing unit of a kind of hardware table item check errors provided for the embodiment of the present invention, can comprise:
Acquisition module 31, for the hardware table item check errors information of this device of the check information register record of acquisition device; Wherein, described hardware table item check errors information comprises the address of hardware table item of makeing mistakes and number of times of makeing mistakes;
Fault detection module 32, whether the number of times of makeing mistakes of hardware table item of determining to make mistakes in described hardware table item check errors information in the first predetermined time period exceedes threshold value;
Fault restoration module 33, when number of times of makeing mistakes for hardware table item of determining when described fault detection module 32 to make mistakes in described hardware table item check errors information in first predetermined time period exceedes threshold value, determine the index of this hardware table item of makeing mistakes according to the address of this hardware table item of makeing mistakes; The software list item corresponding according to the search index of described hardware table item of makeing mistakes, and hardware table item of makeing mistakes described in refreshing according to the software list item inquired; Determine that the check information register of described device in the second Preset Time does not record new hardware table item check errors information, then keep described devices function state.
Wherein, described fault restoration module 33 is specifically for, the index of the hardware table item that realizes determining to make mistakes according to described hardware table item check errors information by following formula:
Wherein,
for rounding downwards (*), i is the index of hardware table item of makeing mistakes, and Ad0 is the initial address of hardware table item in internal memory, and Ad1 is that number of times of makeing mistakes exceedes the address of the hardware table item of makeing mistakes of threshold value, and S is hardware list item size, and unit is byte.
Wherein, when the described device broken down has reset function,
Described fault restoration module 33 also for, determine that the check information register of described device have recorded new hardware table item check errors information, then described in triggering, the device that breaks down resets; Or, determine that the check information register of described device in described second Preset Time have recorded new hardware table item check errors information, then lower electric treatment is carried out to the described device broken down, and re-power.
Wherein, when the described device broken down has reset function,
Described fault restoration module also for, determine that the check information register of described device in described second Preset Time have recorded new hardware table item check errors information, then described in triggering, the device that breaks down resets; When described fault is repaired not yet, lower electric treatment is carried out to the described device broken down, and re-powers.
Wherein, described fault restoration module 33 also for, determine that the check information register of described device in described second Preset Time have recorded new hardware table item check errors information, then point out device fault described in user.
It will be appreciated by those skilled in the art that the module in the device in embodiment can carry out being distributed in the device of embodiment according to embodiment description, also can carry out respective change and be arranged in the one or more devices being different from the present embodiment.The module of above-described embodiment can merge into a module, also can split into multiple submodule further.
Through the above description of the embodiments, those skilled in the art can be well understood to the mode that the present invention can add required general hardware platform by software and realize, and can certainly pass through hardware, but in a lot of situation, the former is better execution mode.Based on such understanding, technical scheme of the present invention can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product is stored in a storage medium, comprising some instructions in order to make a station terminal equipment (can be mobile phone, personal computer, server, or the network equipment etc.) perform method described in each embodiment of the present invention.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should look protection scope of the present invention.