CN109388519A - Error recovery method and device, processor - Google Patents

Error recovery method and device, processor Download PDF

Info

Publication number
CN109388519A
CN109388519A CN201710668375.9A CN201710668375A CN109388519A CN 109388519 A CN109388519 A CN 109388519A CN 201710668375 A CN201710668375 A CN 201710668375A CN 109388519 A CN109388519 A CN 109388519A
Authority
CN
China
Prior art keywords
data
mistake
significant
bit
insignificant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710668375.9A
Other languages
Chinese (zh)
Other versions
CN109388519B (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Priority to CN202110625444.4A priority Critical patent/CN113419899A/en
Priority to CN201710668375.9A priority patent/CN109388519B/en
Publication of CN109388519A publication Critical patent/CN109388519A/en
Application granted granted Critical
Publication of CN109388519B publication Critical patent/CN109388519B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Detection And Correction Of Errors (AREA)
  • Retry When Errors Occur (AREA)

Abstract

A kind of error recovery method and device, processor, error recovery method include the significant data that the data that mistake occurs are divided into M important level and insignificant data;And Fault recovery is carried out to significant data, if carrying out Fault recovery to significant data and insignificant data when can not restore, wherein M is positive integer.

Description

Error recovery method and device, processor
Technical field
The present invention relates to data processing field, relate more specifically to a kind of error recovery method and device, processor.
Background technique
Neural network (neural network) and neural network processor have been obtained for extremely successful application.But Application of Neural Network number of parameters and calculation amount are all very huge, therefore to the safety of storage unit and computing unit and reliably Property proposes very high requirement.
Fault recovering mechanism can help it to recover from error condition in system fault.But traditional mistake Accidentally Restoration Mechanism recovery periodicity is more, and cost is high, and the throughput of extreme influence system is not appropriate for the neural network of high-throughput Processor.Therefore how to combine the failure tolerance of neural network to carry out Fault recovery becomes a urgent problem to be solved.
Summary of the invention
In view of existing scheme there are the problem of, in order to overcome the shortcomings of above-mentioned prior art, the invention proposes one Kind error recovery method and device, processor.
According to an aspect of the invention, there is provided a kind of error recovery method, comprising:
By occur mistake data be divided into M important level significant data and insignificant data;And to important number According to Fault recovery is carried out, if carrying out Fault recovery to significant data and insignificant data when can not restore, wherein M is positive integer.
In some embodiments, the significant data of each important level is divided into significant bits position and insignificant bit Position, described includes: to carry out Fault recovery, if can not restore, counterweight to important bit to significant data progress Fault recovery It wants the significant bits position of data and insignificant bit is progress Fault recovery.
In some embodiments, the data that mistake occurs are being divided into the significant data of M important level and insignificant Data include: the working sequence monitored in processor in each module, the generation error signal if finding mistake;And believed according to mistake Number, module, pipeline positions and the type of error of mistake occur in location processor.
In some embodiments, described that wrong data will occur to be divided into significant data and insignificant data to include basis The size of data, the size of data absolute value, the type of data, in the write operation frequency of the read operation frequency of data and data At least one is divided.
In some embodiments, the significant data of each important level is divided into significant bits position and insignificant bit Position includes: to extract significant bits position to the significant data of the i-th important level, if the significant data has Xi bit, Yi is a Bit is appointed as significant bits position, then the significant data has Xi-Yi insignificant bits, wherein i=1,2 ..., M, Xi, Yi are positive integer, 0 < Yi≤Xi.
In some embodiments, the Yi bit includes successive bits position or discontinuous bit.
In some embodiments, the data include neural network parameter, and the error recovery method is used for neural network Processor.
According to another aspect of the present invention, a kind of error recovery device is provided, comprising: important level division unit, it will Occur mistake data be divided into M important level significant data and insignificant data;And Fault recovery control unit, it is right Significant data carries out Fault recovery, if carrying out Fault recovery to significant data and insignificant data when can not restore, wherein M is positive Integer.
In some embodiments, important level division unit further includes significant bits position division unit, is used for each The significant data of important level is divided into significant bits position and insignificant bit, and the Fault recovery control unit is to important number It include that Fault recovery is carried out to important bit according to Fault recovery is carried out, if can not restore, to the significant bits of significant data Position and insignificant bit are to carry out Fault recovery.
In some embodiments, error recovery device includes: fault monitoring unit, monitors the work in processor in each module Make timing, the generation error signal if finding mistake;And location of mistake unit is sent out in location processor according to error signal Module, pipeline positions and the type of error of raw mistake.
In some embodiments, the data that mistake occurs are divided into significant data and insignificant by important level division unit Data include according to the size of data, the size of data absolute value, the type of data, the read operation frequency of data and writing for data At least one of operating frequency is divided.
In some embodiments, the significant data of each important level is divided into important ratio by important level division unit Special position and insignificant bit include: to extract significant bits position to the significant data of the i-th important level, if the significant data has Xi bit, Yi bit are appointed as significant bits position, then the significant data has Xi-Yi insignificant bits, Middle i=1,2 ..., M, Xi, Yi are positive integer, 0 < Yi≤Xi.
In some embodiments, the Yi bit includes successive bits position or discontinuous bit.
In some embodiments, the data include neural network parameter, and the error recovery device is used for neural network Processor.
According to a further aspect of the invention, a kind of processor is provided, comprising: preprocessing module, DMA, storage unit, defeated Enter at least one of cache unit, instruction control unit and arithmetic element and at least one any Fault recovery above-mentioned Device.
In some embodiments, the error recovery device and the preprocessing module, DMA, storage unit, input-buffer At least one of unit, instruction control unit and arithmetic element are connected, for single to the preprocessing module, DMA, storage The mistake of at least one of member, input-buffer unit, instruction control unit and arithmetic element carries out Fault recovery.
In some embodiments, the error recovery device be it is multiple, with the preprocessing module, DMA, storage unit, It connects one to one in input-buffer unit, instruction control unit and arithmetic element, each error recovery device is to being connected thereto The preprocessing module, DMA, storage unit, input-buffer unit, the mistake in instruction control unit and arithmetic element carry out Fault recovery.
In some embodiments, error recovery device is embedded in the preprocessing module, DMA, storage unit, input-buffer In at least one of unit, instruction control unit and arithmetic element.
In some embodiments, carrying out Fault recovery for the mistake of arithmetic element includes the operation for re-executing parameter.
In some embodiments, the processor includes neural network processor, and the storage unit is for storing nerve Neuron, weight and/or the instruction of network;Described instruction control unit generates control for receiving described instruction after decoding Information processed controls arithmetic element;The arithmetic element simultaneously will for receiving weight and neuron, completion neural metwork training operation Output neuron is retransmitted to storage unit.
It can be seen from the above technical proposal that the present invention have it is at least following the utility model has the advantages that
The significant bits position in the important level that wrong data occurs and each important level of setting is distinguished, is preferentially restored important Secondly bit is restoring significant data when cannot restore, restore all data when cannot restore again, be applied to neural network Restore the neural network processor that the period is short, power consumption income is good, high suitable for throughput when processor;
Storage unit, instruction control unit and the arithmetic element of neural network processor respectively correspond Fault recovery dress It sets, so that the Fault recovery period further shortens.
Detailed description of the invention
Fig. 1 is the flow chart of error recovery method in one embodiment of the invention;
Fig. 2 is the structural block diagram of another embodiment of the present invention data redundancy device;
Fig. 3 is the structural block diagram of yet another embodiment of the invention processor;
Fig. 4 is the structural block diagram of further embodiment of this invention processor.
Specific embodiment
Certain embodiments of the invention will be done referring to appended attached drawing in rear and more comprehensively describe to property, some of but not complete The embodiment in portion will be shown.In fact, various embodiments of the present invention can be realized in many different forms, and should not be construed To be limited to this several illustrated embodiment;Relatively, these embodiments are provided so that the present invention meets applicable legal requirement.
In the present specification, following various embodiments for describing the principle of the invention only illustrate, should not be with any Mode is construed to the range of limitation invention.Referring to attached drawing the comprehensive understanding described below that is used to help by claim and its equivalent The exemplary embodiment of the present invention that object limits.Described below includes a variety of details to help to understand, but these details are answered Think to be only exemplary.Therefore, it will be appreciated by those of ordinary skill in the art that not departing from scope and spirit of the present invention In the case where, embodiment described herein can be made various changes and modifications.In addition, for clarity and brevity, The description of known function and structure is omitted.In addition, running through attached drawing, same reference numerals are used for identity function and operation.
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference Attached drawing, the present invention is described in more detail.
The embodiment of the invention provides a kind of error recovery method, distinguishes the important level that wrong data occurs and setting is each Secondly significant data is being restored when preferentially restoring significant bits position, cannot restore in significant bits position in important level, cannot be extensive Restore all data again when multiple, be applied to restore when neural network processor that the period is short, power consumption income is good, suitable for handling up The high neural network processor of rate.
Specifically, Fig. 1 shows the flow chart of error recovery method, as shown in Fig. 1, error recovery method includes following Specific steps:
Step S101: the working sequence in monitoring processor in each module, the generation error signal if finding mistake;
Step S102: according to error signal, module, pipeline positions and the wrong class of mistake occur in location processor Type;
The mistake includes but are not limited to: correctable error, correctable error, non-lethal mistake, mortality is not wrong It misses and other kinds of mistake.The error situation that mistake includes but are not limited to the function of hardware interface can not be corrected.It can correct Mistake, which includes but are not limited to hardware, can restore the error situation without any information loss.Mortality mistake includes but not only It is limited to cause hardware is insecure can not correct error situation.Non-lethal mistake includes but are not limited to cause particular task not Reliable but hardware is the not correctable error of perfect in shape and function.
Step S103: by occur mistake data be divided into M important level significant data and insignificant data;
The division of the important level of the division and significant data of significant data and insignificant data can be according to parameter Size, the size of parameter absolute value, the type (shaping, floating type) of parameter, the read operation frequency of parameter, parameter write operation frequency Etc. at least one of factors carry out.
Step S104: the significant data of each important level is divided into significant bits position and insignificant bit;
Specifically, the bit in data is divided into significant bits position and insignificant bit.To the important of the i-th important level Data extract significant bits position, if the significant data has Xi bit, Yi bit is appointed as significant bits position, then should Significant data has Xi-Yi insignificant bits, wherein i=1,2 ..., M, Xi, and Yi is positive integer, 0 < Yi≤Xi, Y The position of significant bits position can be continuously, be also possible to discontinuous.
Step S105: Fault recovery is carried out to wrong data.
Specifically, Fault recovery preferentially is carried out to the significant bits position of each important level data, if can not restore, Secondary, significant bits position and insignificant bit to the significant data of each important level are to carry out Fault recovery, if still can not Restore, then in wrong data significant data and insignificant data carry out Fault recovery.
Above-mentioned steps be not it is necessary, in some embodiments, data only can be divided into significant data and non- Significant data, and do not have to distinguish the significant bits position in significant data and insignificant bit, at this time when carrying out Fault recovery, Preferentially to significant data carry out Fault recovery, if can not restore, in wrong data significant data and insignificant data it is equal Carry out Fault recovery.
Data include neural network parameter in above-described embodiment, and neural network parameter includes the neuron and power of neural network Value, further includes the topological structure of neural network, can also include instruction, and error recovery method can be used for Processing with Neural Network Device.
Another embodiment of the present invention provides a kind of error recovery device, Fig. 2 is another embodiment of the present invention Fault recovery dress The structural block diagram set, as shown in Fig. 2, error recovery device 100 includes monitoring unit 10, location of mistake unit 20, important level Division unit 30 and Fault recovery control unit 40.
Monitoring unit 10 monitors the working sequence in processor in each module, when an error is discovered generation error signal.It is wrong Accidentally positioning unit 20 receives the error signal that monitoring unit 10 generates, and according to error signal, mistake occurs in location processor Module, pipeline positions and type of error.
The mistake includes but are not limited to: correctable error, correctable error, non-lethal mistake, mortality is not wrong It misses and other kinds of mistake.The error situation that mistake includes but are not limited to the function of hardware interface can not be corrected.It can correct Mistake, which includes but are not limited to hardware, can restore the error situation without any information loss.Mortality mistake includes but not only It is limited to cause hardware is insecure can not correct error situation.Non-lethal mistake includes but are not limited to cause particular task not Reliable but hardware is the not correctable error of perfect in shape and function.
The data that mistake occurs are divided into the significant data of M important level and insignificant by important level division unit 30 The division of data, the important level of the division and significant data of significant data and insignificant data can be according to the big of parameter It is small, the size of parameter absolute value, the type (shaping, floating type) of parameter, the read operation frequency of parameter, parameter write operation frequency etc. At least one of factor carries out.
Important level division unit 30 can also include significant bits position division unit 31, significant bits position division unit 31 The significant data of each important level is divided into significant bits position and insignificant bit, specifically, the bit in data Position is divided into significant bits position and insignificant bit.Significant bits position is extracted to the significant data of the i-th important level, if this is important Data have Xi bit, and Yi bit is appointed as significant bits position, then the significant data has Xi-Yi insignificant ratios Special position, wherein i=1,2 ..., M, Xi, Yi are positive integer, and 0 < Yi≤Xi, the position of Y significant bits position can be continuously , it is also possible to discontinuous.
Fault recovery control unit 40 carries out Fault recovery to wrong data.
Specifically, it is extensive preferentially to carry out mistake to the significant bits position of each important level data for Fault recovery control unit 40 It is multiple, if can not restore, secondly, the significant bits position and insignificant bit to the significant data of each important level are to carry out mistake Accidentally restore, if can not still restore, in wrong data significant data and insignificant data carry out Fault recovery.
Above-mentioned steps be not it is necessary, in some embodiments, important level division unit 30 does not include important ratio Data are only divided into significant data and insignificant data by special position division unit 31, and do not have to distinguish important in significant data Bit and insignificant bit, at this time Fault recovery control unit 40 carry out Fault recovery when, preferentially to significant data into Row Fault recovery, if can not restore, in wrong data significant data and insignificant data carry out Fault recovery.
Data include neural network parameter in above-described embodiment, and neural network parameter includes the neuron and power of neural network Value, further includes the topological structure of neural network, can also include instruction, and error recovery method can be used for Processing with Neural Network Device.
Yet another embodiment of the invention provides a kind of processor, comprising: in storage unit, instruction control unit and arithmetic element At least one and at least one above-mentioned error recovery device 100.
Processor can be neural network processor 1000, and Fig. 3 shows neural network processor in another embodiment Structural block diagram, as shown in figure 3, neural network processor 1000 includes storage unit 200, instruction control unit 300 and operation Unit 400.
Storage unit 200 receives extraneous input data, stores the neuron, weight and/or instruction of neural network, and will Instruction is sent to instruction control unit 300, and neuron and weight are sent to arithmetic element 400.
Instruction control unit 300, receives the instruction that storage unit 200 is sent, and control information control is generated after decoding Arithmetic element 400.
Arithmetic element 400 completes neural metwork training fortune for receiving the weight and neuron of the transmission of storage unit 200 It calculates and output neuron is retransmitted to storage unit 200 and store.
As shown in figure 3, neural network processor 1000 further includes and storage unit 200, instruction control unit 300 and fortune Calculate unit 400 and distinguish corresponding error recovery device 100, error recovery device 100 respectively in embedded to its it is corresponding in, respectively For carrying out Fault recovery to storage unit 200, instruction control unit 300 and arithmetic element 400.
In some embodiments, mistake corresponding with storage unit 200, instruction control unit 300 and arithmetic element 400 Accidentally recovery device 100 can be connected with storage unit 200, instruction control unit 300 and arithmetic element 400 respectively, not Inline mode must be used.
In some embodiments, neural network processor 1000 can only include an error recovery device 100, with storage Unit 200, instruction control unit 300 and arithmetic element 400 are all connected with, for storage unit 200, instruction control unit 300 and arithmetic element 400 carry out Fault recovery.
For the neural network processor 100 shown in Fig. 3, data include neural network parameter, and neural network parameter includes The neuron and weight of neural network further include the topological structure of neural network, can also include instruction.
When restoring 200 mistake of storage unit, Fault recovery preferably is carried out to the significant bits position in important parameter, if can not When restoring mistake, secondary choosing, to including that important parameter carries out Fault recovery, if can not still restore mistake, to all parameters Carry out Fault recovery.
Important parameter and significant bits position in neural network parameter can be used carries out redundant storage in a manner of Error Correction of Coding. Error Correction of Coding includes but are not limited to cyclic redundancy check (Cyclic Redundancy Check, CRC), error checking and entangles Positive (Error Correcting Code, ECC).
ECC check can correct the mistake of 1bit, and when mistake of more than one bit can not restore.
CRC check includes CRC12, and wherein CRC12 verifies error detecing capability are as follows: the first, if the bit number for mistake occur is Odd number can then be corrected second, if there is number of error bits less than or equal to 5, can correct the third, if single burst The length that malfunctions is less than or equal to 12, then can correct the 4th kind, if two burst error lengths are less than or equal to 2, can correct.When When mistake is more than above-mentioned four kinds of situations, then it can not restore mistake.
Important parameter and significant bits position in neural network parameter can carry out copy backup, and copy redundancy can be same One piece of storage medium standby redundancy, can also be in different storage medium standby redundancies.Data can back up N parts simultaneously, and wherein N is Positive integer greater than 0.
When using Error Correction of Coding mode mistake can not be restored, Fault recovery can be carried out using the data of backup.
When physical address mistake occurrence frequency is more than threshold value T, then the physical address is set as invalid physical address, released Invalid physical address is put, and kills the corresponding instruction of invalid physical address.
When restoring instruction control unit mistake, decodes again and execute the instruction that mistake occurs.
When restoring arithmetic element mistake, it is preferred that retain the operation result of inessential parameter and the fortune of inessential bit It calculates as a result, re-executing the operation of significant bits position in important parameter;When preferred method can not restore mistake, secondary choosing, The operation result for retaining inessential parameter re-executes the operation of important parameter, when the method for secondary choosing can not still restore mistake When, then re-execute the operation of all parameters.
Fig. 4 shows the structural block diagram of further embodiment of this invention processor, in the present embodiment, the processing of processor 2000 Including each unit structure in all upper embodiments.It is divided into preprocessing module 2001 and neural network computing mould on the whole Block 2002.
Preprocessing module 2001 pre-processes original input data, including cutting, gaussian filtering, binaryzation, canonical Change, normalization etc., and will treated data input values neural network computing module 2002.
Neural network computing module 2002 carries out neural network computing, exports final transport result.
Neural network computing module 2002 includes storage unit 200 identical with a upper embodiment, control unit 300 and fortune Calculate unit 400.
Storage unit 200 is mainly used to store the neuron of neural network, weight and instruction.Wherein when storage weight only The location information for storing non-zero weight and non-zero weight, only stored when storing the non-zero weight of quantization non-zero weight code book and Non-zero weight dictionary.Control unit 300 receives the instruction that storage unit 200 is sent, and control information control is generated after decoding Arithmetic element 400.Arithmetic element 400 executes corresponding operation to the data according to the instruction stored in storage unit 22.
Neural network computing module 2002 further includes immediate data access unit (DMA direct memory access) 500, input-buffer unit 600, look-up table 700, select counting unit 800 and output neuron to cache 900.
DMA500 is used in the storage unit 200, input-buffer unit 600 and output neuron caching 900 carry out Data or instruction read-write.
Input-buffer unit 600 includes instruction buffer 601, non-zero weight code book caching 602, non-zero weight dictionary caching 603, non-zero weight position caching 604, input neuron caching 605.
Instruction buffer 601, for storing special instruction;
Non-zero weight code book caching 602, for caching non-zero weight code book;
Non-zero weight dictionary caching 603, for caching non-zero weight weight dictionary;
Non-zero weight position caching 604, for caching non-zero weight position data;Non-zero weight position caching will input number Each connection weight is corresponded to corresponding input neuron in.
It is indicates connection using 1 that non-zero weight position, which caches one-to-one method, under a kind of situation, and 0 indicates without company It connects, the connection status of every group of output and all inputs, which forms one 0 and 1 character string, indicates the connection relationship of the output.Separately It is indicates connection using 1 that non-zero weight position, which caches one-to-one method, under a kind of situation, and 0 expression is connectionless, and every group defeated Enter to form one 0 and 1 character string with the connection status of all outputs and indicate the connection relationship of the input.Under another situation It is by the input neuron positional distance where one group of output first connection that non-zero weight position, which caches one-to-one method, Second group of distance of first input neuron, output input neuron input the distance of neuron, institute apart from upper one The distance ... ... that output third group input neuron inputs neuron apart from upper one is stated, and so on, until exhaustion is described defeated All inputs out, to indicate the connection relationship of the output.
Neuron caching 605 is inputted, for caching the input neuron for being input to coarseness and selecting counting unit.
Look-up table unit 700 is used to parse the weight of the neural network after quantization, receives weight dictionary and weight password This, obtains weight by search operation, directly passes through bypass to arithmetic element for the weight not quantified.
Select counting unit 800 for receive input neuron and non-zero weight location information, select and calculated Neuron.Counting unit is selected, for receiving input neuron and non-zero weight location information, selects the corresponding nerve of non-zero weight Member.That is: for each output nerve metadata, selecting counting unit to get rid of not corresponding with the output nerve metadata The input neuron number evidence of non-zero weight data.
Error recovery device can be embedded in preprocessing module 2001, storage unit 200, control unit 300, arithmetic element 400, DMA500, input-buffer unit 600, look-up table unit 700, at least one of select counting unit 800 in carry out mistake extensive It is multiple.
Discribed process or method can be by including hardware (for example, circuit, special logic etc.), consolidating in the attached drawing of front Part, software (for example, being carried on the software in non-transient computer-readable media), or both combined processing logic hold Row.Although process or method are described according to the operation of certain sequences above, however, it is to be understood that described certain operation energy It is executed with different order.In addition, can concurrently rather than be sequentially performed some operations.
It should be noted that in attached drawing or specification text, the implementation for not being painted or describing is affiliated technology Form known to a person of ordinary skill in the art, is not described in detail in field.In addition, the above-mentioned definition to each element and method is simultaneously It is not limited only to various specific structures, shape or the mode mentioned in embodiment, those of ordinary skill in the art can carry out letter to it It singly changes or replaces.
Particular embodiments described above has carried out further in detail the purpose of the present invention, technical scheme and beneficial effects Describe in detail bright, it should be understood that the above is only a specific embodiment of the present invention, is not intended to restrict the invention, it is all Within the spirit and principles in the present invention, any modification, equivalent substitution, improvement and etc. done should be included in protection of the invention Within the scope of.

Claims (20)

1. a kind of error recovery method, wherein include:
By occur mistake data be divided into M important level significant data and insignificant data;And
Fault recovery is carried out to significant data, if carrying out Fault recovery to significant data and insignificant data when can not restore, Middle M is positive integer.
2. error recovery method according to claim 1, wherein the significant data division of each important level is attached most importance to Bit and insignificant bit are wanted, described includes: extensive to important bit progress mistake to significant data progress Fault recovery Multiple, if can not restore, significant bits position and insignificant bit to significant data are to carry out Fault recovery.
3. error recovery method according to claim 1 or 2, wherein a important the data that mistake occurs are divided into M The significant data of grade and insignificant data include:
The working sequence in processor in each module is monitored, the generation error signal if finding mistake;And
According to error signal, module, pipeline positions and the type of error of mistake occur in location processor.
4. error recovery method according to claim 1 or 2, wherein it is described will occur mistake data be divided into it is important Data and insignificant data include the read operation frequency according to the size of data, the size of data absolute value, the type of data, data At least one of write operation frequency of rate and data is divided.
5. error recovery method according to claim 2, wherein the significant data division of each important level is attached most importance to It wants bit and insignificant bit includes:
Significant bits position is extracted to the significant data of the i-th important level, if the significant data has Xi bit, Yi bit Position is appointed as significant bits position, then the significant data has Xi-Yi insignificant bits, wherein i=1,2 ..., M, Xi, Yi For positive integer, 0 < Yi≤Xi.
6. error recovery method according to claim 5, wherein the Yi bit includes successive bits position or non-company Continuous bit.
7. error recovery method according to claim 1 or 2, wherein the data include neural network parameter, the mistake Accidentally restoration methods are used for neural network processor.
8. a kind of error recovery device, wherein include:
Important level division unit, by occur mistake data be divided into M important level significant data and insignificant data; And
Fault recovery control unit carries out Fault recovery to significant data, if to significant data and insignificant number when can not restore According to Fault recovery is carried out, wherein M is positive integer.
9. error recovery device according to claim 8, wherein important level division unit further includes that significant bits position is drawn Sub-unit, for the significant data of each important level to be divided into significant bits position and insignificant bit, the mistake Restore control unit to carry out Fault recovery to significant data to include carrying out Fault recovery to important bit, if can not restore, Significant bits position and insignificant bit to significant data are to carry out Fault recovery.
10. error recovery device according to claim 8 or claim 9, wherein further include:
Fault monitoring unit monitors the working sequence in processor in each module, the generation error signal if finding mistake;And
According to error signal module, pipeline positions and the wrong class of mistake occur for location of mistake unit in location processor Type.
11. error recovery device according to claim 8 or claim 9, wherein the number of mistake will occur for important level division unit According to being divided into significant data and insignificant data include according to the size of data, the size of data absolute value, the type of data, number According to read operation frequency and at least one of the write operation frequencies of data divided.
12. error recovery device according to claim 9, wherein important level division unit is by each important level Significant data be divided into significant bits position and insignificant bit includes:
Significant bits position is extracted to the significant data of the i-th important level, if the significant data has Xi bit, Yi bit Position is appointed as significant bits position, then the significant data has Xi-Yi insignificant bits, wherein i=1,2 ..., M, Xi, Yi For positive integer, 0 < Yi≤Xi.
13. error recovery device according to claim 12, wherein the Yi bit includes successive bits position or non- Successive bits position.
14. error recovery device according to claim 12, wherein the data include neural network parameter, the mistake Accidentally recovery device is used for neural network processor.
15. a kind of processor, wherein include:
At least one of preprocessing module, DMA, storage unit, input-buffer unit, instruction control unit and arithmetic element, And
Any error recovery device at least one claim 8-14.
16. processor according to claim 15, wherein the error recovery device and the preprocessing module, DMA, depositing At least one of storage unit, input-buffer unit, instruction control unit and arithmetic element are connected, for the pretreatment The mistake of at least one of module, DMA, storage unit, input-buffer unit, instruction control unit and arithmetic element carries out wrong Accidentally restore.
17. processor according to claim 15, wherein the error recovery device be it is multiple, with the pretreatment mould It connects one to one in block, DMA, storage unit, input-buffer unit, instruction control unit and arithmetic element, each mistake is extensive Apparatus for coating is to the preprocessing module connected to it, DMA, storage unit, input-buffer unit, instruction control unit and operation Mistake in unit carries out Fault recovery.
18. processor according to claim 15, wherein error recovery device be embedded in the preprocessing module, DMA, In at least one of storage unit, input-buffer unit, instruction control unit and arithmetic element.
19. processor according to claim 15, wherein carrying out Fault recovery for the mistake of arithmetic element includes again Execute the operation of parameter.
20. processor according to claim 15, wherein the processor includes neural network processor, the storage Unit is used to store the neuron, weight and/or instruction of neural network;Described instruction control unit, for receiving described instruction, Control information is generated after decoding controls arithmetic element;The arithmetic element completes nerve for receiving weight and neuron Output neuron is simultaneously retransmitted to storage unit by network training operation.
CN201710668375.9A 2017-08-07 2017-08-07 Error recovery method and device and processor Active CN109388519B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110625444.4A CN113419899A (en) 2017-08-07 2017-08-07 Error recovery method and device and processor
CN201710668375.9A CN109388519B (en) 2017-08-07 2017-08-07 Error recovery method and device and processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710668375.9A CN109388519B (en) 2017-08-07 2017-08-07 Error recovery method and device and processor

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202110625444.4A Division CN113419899A (en) 2017-08-07 2017-08-07 Error recovery method and device and processor

Publications (2)

Publication Number Publication Date
CN109388519A true CN109388519A (en) 2019-02-26
CN109388519B CN109388519B (en) 2021-06-11

Family

ID=65413441

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202110625444.4A Pending CN113419899A (en) 2017-08-07 2017-08-07 Error recovery method and device and processor
CN201710668375.9A Active CN109388519B (en) 2017-08-07 2017-08-07 Error recovery method and device and processor

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202110625444.4A Pending CN113419899A (en) 2017-08-07 2017-08-07 Error recovery method and device and processor

Country Status (1)

Country Link
CN (2) CN113419899A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489852A (en) * 2019-08-14 2019-11-22 北京天泽智云科技有限公司 Improve the method and device of the wind power system quality of data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101547144A (en) * 2008-12-29 2009-09-30 华为技术有限公司 Method, device and system for improving data transmission quality
US20100037097A1 (en) * 2008-08-07 2010-02-11 Hitachi, Ltd. Virtual computer system, error recovery method in virtual computer system, and virtual computer control program
CN102017498A (en) * 2008-05-06 2011-04-13 阿尔卡特朗讯公司 Recovery of transmission errors
CN106648968A (en) * 2016-10-19 2017-05-10 盛科网络(苏州)有限公司 Data recovery method and device when ECC correction failure occurs on chip
CN107025148A (en) * 2016-10-19 2017-08-08 阿里巴巴集团控股有限公司 A kind for the treatment of method and apparatus of mass data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102017498A (en) * 2008-05-06 2011-04-13 阿尔卡特朗讯公司 Recovery of transmission errors
US20100037097A1 (en) * 2008-08-07 2010-02-11 Hitachi, Ltd. Virtual computer system, error recovery method in virtual computer system, and virtual computer control program
CN101547144A (en) * 2008-12-29 2009-09-30 华为技术有限公司 Method, device and system for improving data transmission quality
CN106648968A (en) * 2016-10-19 2017-05-10 盛科网络(苏州)有限公司 Data recovery method and device when ECC correction failure occurs on chip
CN107025148A (en) * 2016-10-19 2017-08-08 阿里巴巴集团控股有限公司 A kind for the treatment of method and apparatus of mass data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郑方等: "面向高性能计算的众核处理器轻量级错误恢复技术研究", 《计算机研究与发展》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489852A (en) * 2019-08-14 2019-11-22 北京天泽智云科技有限公司 Improve the method and device of the wind power system quality of data

Also Published As

Publication number Publication date
CN109388519B (en) 2021-06-11
CN113419899A (en) 2021-09-21

Similar Documents

Publication Publication Date Title
CN104052576B (en) Data recovery method based on error correcting codes in cloud storage
CN101937724B (en) Method for performing copy back operations and flash storage device
US9507662B2 (en) Expanded error correction codes
CN103077095B (en) Error correction method and device for stored data and computer system
CN103531246B (en) The memorizer error correction method quickly write with reading
CN103699512A (en) Thread sparing between cores in a multi-threaded processor
CN110427156B (en) Partition-based MBR (Membrane biological reactor) parallel reading method
WO2023151290A1 (en) Data encoding method and apparatus, device, and medium
WO2019173075A4 (en) Mission-critical ai processor with multi-layer fault tolerance support
CN103594120A (en) Memorizer error correction method adopting reading to replace writing
CN101615147A (en) The skin satellite is based on the fault-tolerance approach of the memory module of FPGA
CN102571259A (en) Safe and controllable communication method used for numerical control bus
US20110035643A1 (en) System and Apparatus for Error-Correcting Register Files
WO2024098647A1 (en) Check-code recovery method and system, electronic device and storage medium
CN104991833A (en) Method and electronic equipment for error detection
CN107153661A (en) A kind of storage, read method and its device of the data based on HDFS systems
CN104375905A (en) Incremental backing up method and system based on data block
CN103838649A (en) Method for reducing calculation amount in binary coding storage system
CN109388519A (en) Error recovery method and device, processor
CN104598330B (en) Data based on double copies are preserved and method of calibration
US20170324512A1 (en) FEC Decoding Apparatus and Method
CN110750385B (en) Graph iterator and method based on limited recovery
CN101840363A (en) Method and device for comparing file blocks
CN205193785U (en) Self -check and recovery device of duplication redundancy assembly line
CN110908835B (en) Data redundancy method and system supporting private label in distributed system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant