WO2022176006A1 - 通信装置、通信故障管理方法、故障管理プログラム、及び通信システム - Google Patents
通信装置、通信故障管理方法、故障管理プログラム、及び通信システム Download PDFInfo
- Publication number
- WO2022176006A1 WO2022176006A1 PCT/JP2021/005657 JP2021005657W WO2022176006A1 WO 2022176006 A1 WO2022176006 A1 WO 2022176006A1 JP 2021005657 W JP2021005657 W JP 2021005657W WO 2022176006 A1 WO2022176006 A1 WO 2022176006A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- error
- unit
- communication
- signal
- cram
- Prior art date
Links
- 238000004891 communication Methods 0.000 title claims abstract description 462
- 238000007726 management method Methods 0.000 title claims description 73
- 238000011144 upstream manufacturing Methods 0.000 claims abstract description 108
- 238000001514 detection method Methods 0.000 claims abstract description 98
- 238000012360 testing method Methods 0.000 claims abstract description 71
- 238000003745 diagnosis Methods 0.000 claims abstract description 36
- 230000035945 sensitivity Effects 0.000 claims abstract description 33
- 230000004044 response Effects 0.000 claims abstract description 29
- 238000000034 method Methods 0.000 claims description 64
- 238000012545 processing Methods 0.000 claims description 56
- 230000008569 process Effects 0.000 claims description 54
- 238000007781 pre-processing Methods 0.000 claims description 40
- 230000006870 function Effects 0.000 description 32
- 230000005540 biological transmission Effects 0.000 description 14
- 238000012937 correction Methods 0.000 description 10
- 230000008859 change Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000002159 abnormal effect Effects 0.000 description 4
- 238000011084 recovery Methods 0.000 description 4
- QIVUCLWGARAQIO-OLIXTKCUSA-N (3s)-n-[(3s,5s,6r)-6-methyl-2-oxo-1-(2,2,2-trifluoroethyl)-5-(2,3,6-trifluorophenyl)piperidin-3-yl]-2-oxospiro[1h-pyrrolo[2,3-b]pyridine-3,6'-5,7-dihydrocyclopenta[b]pyridine]-3'-carboxamide Chemical compound C1([C@H]2[C@H](N(C(=O)[C@@H](NC(=O)C=3C=C4C[C@]5(CC4=NC=3)C3=CC=CN=C3NC5=O)C2)CC(F)(F)F)C)=C(F)C=CC(F)=C1F QIVUCLWGARAQIO-OLIXTKCUSA-N 0.000 description 3
- JJWKPURADFRFRB-UHFFFAOYSA-N carbonyl sulfide Chemical compound O=C=S JJWKPURADFRFRB-UHFFFAOYSA-N 0.000 description 3
- 230000007480 spreading Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- XULSCZPZVQIMFM-IPZQJPLYSA-N odevixibat Chemical compound C12=CC(SC)=C(OCC(=O)N[C@@H](C(=O)N[C@@H](CC)C(O)=O)C=3C=CC(O)=CC=3)C=C2S(=O)(=O)NC(CCCC)(CCCC)CN1C1=CC=CC=C1 XULSCZPZVQIMFM-IPZQJPLYSA-N 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000005856 abnormality Effects 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 230000007257 malfunction Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000000116 mitigating effect Effects 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/073—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04B—TRANSMISSION
- H04B17/00—Monitoring; Testing
- H04B17/20—Monitoring; Testing of receivers
- H04B17/29—Performance testing
Definitions
- the present invention relates to a communication device, a communication failure management method, a failure management program, and a communication system, and particularly to countermeasures against soft errors affecting communication device failures.
- Non-Patent Document 2 Even when using the error correction function, soft errors may affect communication equipment failures. For example, if an error occurs in the CRAM (Configuration Random Access Memory) that exists in the internal circuit of a programmable device such as an FPGA (Field-Programmable Gate Array), the error is corrected and the logic circuit configured by the FPGA is correct. Communication information is processed by the FPGA before it is modified into a logical structure, and the results of the processing affect downstream communication equipment, which may affect communication services and the like.
- CRAM Configuration Random Access Memory
- FPGA Field-Programmable Gate Array
- Tateno et al. "Soft error test results report for communication equipment with error correction/detection functions", The Institute of Electronics, Information and Communication Engineers Network Systems Study Group, March 2020. Tateno, "Report on Analysis of Failure Influence of CRAM Errors in Communication Devices", 2020 IEICE Communication Society Conference, September 2020.
- an appropriate countermeasure is taken upon detection of a CRAM error, thereby detecting failures early and preventing silent failures, or , it is possible to shorten the time to restore service or suppress the occurrence of failures.
- FIG. 1 shows a configuration example-1 of a communication device according to an embodiment of the present invention.
- the communication device 10 shown in FIG. 1 includes an upstream communication device 11, an FPGA board 12, and a downstream communication device 14 as equipment for processing information transmitted by communication. That is, the information transmitted by the communication device 10 is first input to the upstream communication device 11, the communication signal SG1 output from the upstream communication device 11 is input to the FPGA board 12, and the communication signal output from the FPGA board 12 is input to the FPGA board 12. SG2 is input to the downstream communication device 14 .
- the FPGA board 12 implements the functions of the preprocessing section 13 necessary to generate the downstream communication signal SG2 from the upstream communication signal SG1.
- an FPGA integrated circuit mounted on the circuit board of the FPGA board 12 constitutes the preprocessing section 13 .
- the preprocessing unit 13 has a variable logic circuit unit 12a and a CRAM 12b as main components. That is, the logic circuit configuration of the variable logic circuit section 12a is determined according to the content of the data written in the CRAM 12b, and the processing content of the preprocessing section 13 is determined.
- the CRAM error detector 12c is incorporated inside the integrated circuit or on the FPGA board 12 in order to detect errors as described above. Since the CRAM 12b is an ECC (Error Correction Code) memory and redundant bits are added to each data, the CRAM error detector 12c can detect the occurrence of a soft error based on the redundant bits. The CRAM error detector 12c detects errors in the CRAM that determines the logic structure inside the programmable device. In practice, 1-bit errors can be detected and corrected, and 2-bit errors can only be detected.
- ECC Error Correction Code
- the CRAM error detection unit 12c automatically corrects the error using ECC, so the problem with the FPGA itself can be avoided. Therefore, normally no warning is output for a 1-bit soft error. Also, in the case of a 2-bit error, by generating an alarm, it is possible to prevent the influence of the error from spreading to downstream communication devices by, for example, restarting the system.
- the CRAM error detector 12c sequentially scans the entire storage area of the CRAM 12b to detect errors, it takes a certain amount of time from the occurrence of an error until it is corrected.
- the temporarily abnormal communication signal SG2 is output from the FPGA board 12 to the downstream side. Input to communication device 14 .
- This abnormal communication signal SG2 may spread to a failure on the downstream side communication device 14 side, leading to a silent failure whose cause is unknown because an alarm is not output, thereby increasing the damage.
- the communication device 10 shown in FIG. 1 includes a failure detection unit 15 and a failure management unit 16 as countermeasures against a 1-bit soft error in the CRAM 12b.
- These failure detection unit 15 and failure management unit 16 in response to occurrence of an upstream error detected by the CRAM error detection unit 12c, investigate failures that occur in the downstream communication device 14 due to this upstream error, It operates as a downstream failure processor that performs at least one of suppression and recovery. Further, the CRAM error detector 12c is configured to transmit an error notification ER1 when a 1-bit soft error occurs.
- the failure management unit 16 Upon receiving the error notification ER1 from the CRAM error detection unit 12c, the failure management unit 16 transmits a sensitivity change command CM1 to the failure detection unit 15.
- FIG. The failure detection unit 15 temporarily increases the sensitivity of failure detection in the downstream communication device 14 in accordance with the sensitivity change command CM1 from the failure management unit 16 compared to the normal time.
- the fault management unit 16 operates as a sensitivity adjustment instructing unit that at least temporarily increases the fault detection sensitivity of the fault detection unit 15 compared to the normal state in response to the occurrence of an upstream error.
- Increasing the sensitivity of failure detection in the downstream communication device 14 shortens the time required from the occurrence of some abnormality until the failure detection unit 15 actually detects the failure caused by it, that is, the detection delay time.
- the fault detection unit 15 monitoring it detects a fault.
- the detection delay time is shortened and the detection sensitivity to the failure is increased.
- the failure detection unit 15 monitoring it detects a failure. It is assumed that In this case, for example, by changing the threshold value of the number of consecutive detections regarded as abnormal from "4" in the normal state to "3", the detection delay time is shortened and the detection sensitivity to failures is increased.
- the failure detection unit 15 shown in FIG. 1 detects a failure in the downstream communication device 14, it notifies the failure to the system (EMS: Element Management System) that manages the occurrence of the failure, the operator of the monitoring center, and the like. In that case, according to the instruction of the remote operator or the autonomous judgment of the device in the EMS, etc., recovery measures such as restarting the device affected by the failure, such as the downstream communication device 14, should be implemented at an early stage. can be done.
- EMS Element Management System
- Failure of the device caused by a soft error is temporary, so the normal state can be restored by restarting the device. Further, in the above description, it is assumed that when a soft error occurs, failure detection will be performed sooner or later, and that the failure detection unit 15 will detect it early. On the other hand, depending on the mode of the failure and the method of detection, it may be possible to discover the cause of the failure itself and automatically repair it by increasing the failure detection sensitivity. In this case, there is the advantage of preventing the occurrence of silent failures.
- FIG. 2 shows an example of main operations of the communication device 10 of FIG.
- the CRAM error detector 12c detects it and sends an error notification ER1 as shown in FIG.
- the failure management unit 16 transmits a sensitivity change command CM1 according to the received error notification ER1.
- the failure detection unit 15 increases the failure detection sensitivity for the downstream communication device 14 according to the received sensitivity change command CM1.
- the threshold parameters such as the protection time for the fault detection unit 15 to detect a fault are often set to optimal values in advance in the environment of the actual communication device. Therefore, after the CRAM error detection section 12c detects a 1-bit soft error in the CRAM 12b and corrects the bit error, and the variable logic circuit section 12a assumes a correct logic structure, a fault detection sensitivity parameter such as a protection time should be returned to normal settings.
- the failure detection unit 15 or the failure management unit 16 automatically restores the failure detection sensitivity to the normal time when a predetermined control period T1 elapses after raising the failure detection sensitivity. do.
- the length of the control period T1 is, for example, about several tens of seconds. That is, after several tens of seconds have passed since the CRAM error detector 12c detected a 1-bit soft error in the CRAM 12b, the error in the bit data in the CRAM 12b is corrected, and the variable logic circuit 12a is restored to the correct logic structure. Therefore, the fault detection sensitivity can be returned to the normal state.
- the CRAM error detection unit 12c detects the error, it is sent to the operator of the system EMS or monitoring center that manages the occurrence of the failure. Notify the failure.
- FIG. 3 shows a communication device 10B of configuration example-2 according to the embodiment of the present invention. Further, FIG. 4 shows an example of main operations in the communication device 10B of FIG.
- the communication device 10B in FIG. 3 is a modified example of the communication device 10 in FIG. 1, and the same components in FIGS. 1 and 3 are indicated by the same reference numerals.
- the communication device 10B in FIG. 3 includes an upstream communication device 11, an FPGA board 12, and a downstream communication device 14, like the communication device 10 in FIG.
- the downstream communication device 14 shown in FIG. 3 is composed of a downstream communication device main body 14a and a signal holding section 14b.
- the signal holding unit 14b has a function of temporarily holding the output of the processing result in the downstream communication device body 14a.
- the communication device 10B shown in FIG. 3 includes a failure detection unit 15 and a failure management unit 16B as countermeasures against 1-bit soft errors in the CRAM 12b.
- the CRAM error detector 12c is configured to transmit an error notification ER1 when a 1-bit soft error occurs.
- the failure management unit 16B in FIG. Upon receiving the error notification ER1 from the CRAM error detection unit 12c, the failure management unit 16B in FIG. Send to the upstream communication device 11 .
- the failure management unit 16B operates as a discard instruction instructing unit for discarding the corresponding signal in the downstream communication device body 14a of the downstream communication device 14 or the signal holding unit 14b in response to the occurrence of an upstream error. .
- the failure management unit 16B issues a signal discard command CM21 to discard it. Then, after the error in the CRAM 12b is corrected and the variable logic circuit section 12a is corrected to the correct logic structure, the upstream communication device 11 complies with the retransmission request CM22 as shown in FIG. It transmits the signal SG1a-2. The contents of the second communication signal SG1a-2 are the same as those of the first communication signal SG1a-1.
- the failure management unit 16B operates as a retransmission request unit that instructs the upstream communication device 11 existing upstream of the preprocessing unit 13 to retransmit the corresponding signal.
- the timing at which the upstream communication device 11 transmits the second communication signal SG1a-2 as a retransmission is the timing at which the error in the CRAM 12b is corrected and the correct information in the CRAM 12b is reflected in the logic structure of the variable logic circuit section 12a. later than Generally, it takes several tens of milliseconds to correct an error in the CRAM 12b, so it is necessary to provide a time interval longer than this time between the two communication signals SG1a-1 and SG1a-2.
- the signal holding unit 14b stores the signal processed by the downstream communication device body 14a. be temporarily put on hold. Then, after the timing at which it is determined that the signal discard command CM21 does not appear, the signal held in the holding state by the signal holding unit 14b is output to the downstream side.
- the signal is When the discard command CM21 does not come, the signal holding unit 14b outputs the result of processing the communication signal SG2 corresponding to the first communication signal SG1a-1 in the downstream side communication device main body 14a to the downstream side and outputs the signal discard command.
- the CM 21 appears, it is discarded by the downstream side communication device body 14a or the signal holding unit 14b.
- the downstream communication device 14 is located at a position where the signal discard command CM21 always appears earlier than the erroneous communication signal SG2, the function of the signal holding unit 14b is unnecessary.
- the signal processed by the side communication device main body 14a can be directly output to the downstream side.
- the downstream communication device 14 processes the communication signal SG2 corresponding to the first communication signal SG1a-1 and outputs the result. For example, when the communication signal SG1 reaches the input of the downstream communication device 14 as the communication signal SG2 60 msec after being processed by the preprocessing unit 13, the signal discard command CM21 is sent to the downstream communication device 14 within 50 msec. Design the system so that It should be noted that the time required for the CRAM error detector 12c to detect a soft error in the CRAM 12b is generally about several tens of milliseconds.
- the upstream communication device 11 As a specific method for the upstream communication device 11 to retransmit the same communication signal SG1, it is assumed that the transmitted signal is stored in a queue inside the upstream communication device 11. In that case, it is conceivable to discard the signal after a predetermined time has elapsed from the time of transmission and transmit the next signal in the queue, or to discard the signal and transmit the next signal in the queue when receiving a discard instruction from the failure management unit 16B. be done.
- FIG. 5 shows a communication device 10C of configuration example-3 according to the embodiment of the present invention.
- 6 and 7 show examples of operation timings and operation procedures in the communication apparatus 10C of FIG. 5, respectively.
- a communication device 10C shown in FIG. 5 is a modified example of the communication device 10B shown in FIG. 3, and the same components in FIGS. 3 and 5 are indicated by the same reference numerals.
- the communication device 10C of FIG. 5 includes an upstream communication device 11C, an FPGA board 12, and a downstream communication device 14, similar to the communication device 10B of FIG. Also, the upstream communication device 11C in FIG. 5 has a function of transmitting a known test signal.
- the downstream communication device 14 shown in FIG. 5 is composed of a downstream communication device main body 14a and a signal holding section 14b.
- the signal holding unit 14b has a function of temporarily holding the output of the processing result in the downstream communication device body 14a.
- the communication device 10C shown in FIG. 5 includes a failure detection unit 15, a failure management unit 16C, and a test signal diagnosis unit 17 as countermeasures against a 1-bit soft error in the CRAM 12b.
- the CRAM error detector 12c is configured to transmit an error notification ER1 when a 1-bit soft error occurs.
- the failure management unit 16C Upon receiving the error notification ER1 from the CRAM error detection unit 12c, the failure management unit 16C in FIG. , and a retransmission request CM33 to the upstream communication device 11C.
- the failure management unit 16C operates as a discard instruction instructing unit for discarding the corresponding signal in the downstream communication device 14 in response to the occurrence of an upstream error.
- the failure management unit 16C instructs the discarding with the signal discarding command CM31. Then, for example, after a predetermined time has elapsed during which it is estimated that the error in the CRAM 12b has been corrected and the variable logic circuit section 12a has been corrected to have the correct logic structure, or before the predetermined time has elapsed, no problem actually occurs.
- the failure management unit 16C transmits a test transmission request CM32 to the upstream communication device 11C in order to diagnose whether or not.
- the failure management unit 16C operates as a test signal requesting unit that instructs the upstream communication device 11 existing upstream of the preprocessing unit 13 to transmit a known test signal in response to the occurrence of an upstream error. do.
- the failure management unit 16C After confirming that there is no error in the logic structure of the variable logic circuit unit 12a by the diagnosis result notification NO3 output from the test signal diagnosis unit 17, the failure management unit 16C sends a retransmission request CM33 to the upstream communication device 11C. Send.
- the failure management unit 16C operates as a retransmission request unit that instructs the upstream communication device 11C to retransmit the discarded signal after the downstream communication device 14 obtains the correct processing result for the test signal SG1x. .
- the upstream communication device 11C transmits the second communication signal SG1a-2 as the original signal in accordance with the retransmission request CM33 as shown in FIG.
- the contents of the second communication signal SG1a-2 are the same as those of the first communication signal SG1a-1.
- the signal holding unit 14b stores the signal processed by the downstream communication device body 14a. be temporarily put on hold. Then, after the timing at which it is determined that the signal discard command CM31 does not appear, the signal held in the holding state by the signal holding unit 14b is output to the downstream side.
- the downstream communication device 14 is located at a position where the signal discard command CM31 always appears earlier than the erroneous communication signal SG2, the function of the signal holding unit 14b is unnecessary.
- the signal processed by the side communication device main body 14a can be directly output to the downstream side.
- the information of the communication signal SG2 corresponding to the first communication signal SG1a-1 is discarded in the downstream side communication device main body 14a or in the signal holding unit 14b by the signal discard command CM31.
- the test transmission request CM32 causes the test signal SG1x to appear as the communication signal SG1.
- the test signal diagnosis section 17 recognizes that the information appearing in the communication signal SG2 due to the test signal SG1x matches the known information of the communication signal SG2. Then, a diagnosis result notification NO3 is output. In response to this diagnosis result notification NO3, the failure management unit 16C sends a resend request CM33, and the upstream communication device 11C sends the second communication signal SG1a-2 as a resend.
- step S01 when a soft error occurs in the CRAM 12b, the error is detected by the CRAM error detector 12c, so the process proceeds from step S01 to S03 and S04. That is, in accordance with the signal discard command CM31, the downstream communication device main body 14a discards the erroneous information of the communication signal SG2 in step S03, and in step S04, the upstream communication device 11C discards known data as the test signal SG1x in accordance with the test transmission request CM32. Send.
- the test signal diagnosis unit 17 compares the information of the communication signal SG2 corresponding to the test signal SG1x with known information to carry out diagnosis. This diagnosis makes it possible to identify whether or not there is an error in the logic structure of the variable logic circuit section 12a. If this diagnosis detects that there is an error in the logic structure, the process returns from step S05 to step S04 to resend the test signal SG1x. If the diagnostic result does not become OK even after resending the test signal SG1x several times, the failure management unit 16C or the failure detection unit 15 executes processing for restarting the device of the related part. do.
- step S05 When the test signal diagnosis section 17 recognizes in step S05 that there is no error in the logic structure of the variable logic circuit section 12a, the process proceeds to the next step S06. That is, the failure management unit 16C transmits a retransmission request CM33 based on the diagnosis result notification NO3, and the upstream communication device 11C transmits the same communication signal SG1a-2 as a retransmission of the original communication signal SG1a-1 in accordance with the retransmission request CM33. do.
- the test signal diagnostic unit 17 confirms that the error correction has been completed. After the confirmation, the original communication signal SG1 is controlled to be resent from the upstream communication device 11C. Therefore, compared with the communication device 10B of FIG. 3, it is considered that more reliable communication control is realized.
- test signal diagnostic unit 17 If the diagnostic result of the test signal diagnostic unit 17 does not become OK even after sending the test signal SG1x several times, there is a possibility that a failure has occurred that cannot be repaired only by error correction of the CRAM 12b. In that case, the operation of the device is restarted to attempt recovery.
- FIG. 8 shows a communication device 10D of configuration example-4 according to the embodiment of the present invention.
- a communication device 10D shown in FIG. 8 is a modification of the communication device 10 shown in FIG. 1, and the same components in FIGS. 1 and 8 are indicated by the same reference numerals.
- this communication device 10D includes an upstream communication device 11, two FPGA boards 12-1 and 12-2, a signal selector 18, a signal holding unit 19, a downstream communication device 14, a CRAM error detection It has a section 12c and a failure management section 16D. Note that, for example, both functions of the two FPGA boards 12-1 and 12-2 may be collectively arranged on one circuit board.
- the two FPGA boards 12-1 and 12-2 shown in FIG. 8 each have the function of the preprocessing unit 13, and the two preprocessing units 13 process the communication signal SG1 through independent communication paths. connected in parallel so that The FPGA boards 12-1 and 12-2 correspond to a first programmable device circuit and a second programmable device circuit connected in parallel with respect to the signal path.
- an FPGA integrated circuit is equipped with a CRAM 12b and a function of detecting and correcting its soft errors in addition to the variable logic circuit section 12a.
- the CRAM error detection unit 12c can individually detect and correct CRAM soft errors in each of the two FPGA boards 12-1 and 12-2.
- the CRAM error detection unit 12c specifies the FPGA board 12-1 or 12-2 in which the error occurred.
- An error notification ER1 is sent to the failure management section 16D.
- the CRAM error detection unit 12c shown in FIG. 8 has the function of detecting and correcting CRAM soft errors in each of the two FPGA boards 12-1 and 12-2. In some cases, the function of transmitting the error notification ER1 in response to an error in one of the standby systems of 1 and 12-2 can be omitted.
- the communication signal SG1 sent by the upstream communication device 11 is branched into two systems and processed by the preprocessing units 13 of the two FPGA boards 12-1 and 12-2. Then, the communication signals SG21 and SG22 respectively output by the two preprocessing units 13 are input to the signal selection unit 18 at the same time.
- the failure management unit 16D outputs a selection control command CM4 according to the state of the error notification ER1 sent by the CRAM error detection unit 12c.
- the signal selection unit 18 selects one of the two systems of communication signals SG21 and SG22 according to the selection control command CM4 input from the failure management unit 16D and outputs it as the communication signal SG2. Further, the signal selector 18 follows the selection control command CM4 and discards the signal that is not selected from among the two systems of communication signals SG21 and SG22.
- the communication signal SG2 output by the signal selector 18 is input to the downstream communication device 14 via the signal reserver 19 .
- the signal holding unit 19 holds the output of the communication signal SG2 to the downstream side until the correct one of the two communication signals SG21 and SG22 is determined. After it is determined that the communication signal SG2 selected by the signal selector 18 is a correct signal, the signal holding unit 19 outputs the communication signal SG2 that has been put on hold to the downstream side.
- the signal selector 18 is arranged at a position where the state of the selection control command CM4 is determined before an erroneous communication signal appears in either of the communication signals SG21 and SG22, the erroneous communication signal is selected.
- the function of the signal hold unit 19 is not necessary because it can be discarded reliably inside the unit 18 .
- the fault management unit 16D selects one of the signals output by the FPGA board 12-1 and the signal output by the FPGA board 12-2 that is not related to the error in response to the occurrence of the upstream error. to the signal selector 18 as a selection instructing unit.
- the operation example shown in FIG. 9 represents an operation for the signal selection unit 18 of the communication device 10D to select one of the two systems of communication signals SG21 and SG22.
- the preprocessing units 13 of the two FPGA boards 12-1 and 12-2 perform the same processing. It doesn't matter if you choose.
- the signal selector 18 selects one communication signal SG21 in the initial state.
- the CRAM error detection unit 12c identifies whether there is an error in the CRAM 12b in the system of the FPGA board 12-2. Then, when an error in the CRAM 12b is detected in step S23, the process proceeds to the next step S24. After that, the state is switched in step S24 so that the signal selector 18 selects the communication signal SG21 output by the non-selected FPGA1, that is, the FPGA board 12-1, and discards the communication signal SG22 that has been selected until then. .
- the signal selection unit 18 selects the system in which no CRAM error is detected from among the two systems of communication signals SG21 and SG22, and alternately switches the system each time a CRAM error is detected. .
- the CRAM error detection unit 12c identifies whether there is an error in the CRAM 12b in the currently selected system of the two FPGA boards 12-1 and 12-2.
- the process proceeds to the next step S26.
- the signal selector 18 selects the other communication signal SG22 output by the non-selected FPGA2, that is, the FPGA board 12-2, and discards the communication signal SG21 that has been selected until then (step S26). to switch states.
- step S27 the signal selection unit 18 selects the communication signal SG21 output from the FPGA1, that is, the FPGA board 12-1, and returns to the same state as the initial state so as to discard the communication signal SG22 that has been selected up to that point.
- next step S28 the communication signal SG21 is in the selected state as in the initial state, so the CRAM error detector 12c identifies whether there is an error in the CRAM 12b in the system of the FPGA board 12-1. Then, when the communication device 10D detects an error in the CRAM 12b in step S28, the process proceeds to the next step S29.
- step S29 the signal selection unit 18 selects the communication signal SG22 output by the non-selected FPGA2, that is, the FPGA board 12-2, and switches the state so that the communication signal SG21 that has been selected until then is discarded.
- the signal selection unit 18 preferentially selects one communication signal SG21 from the two systems of communication signals SG21 and SG22, and when a CRAM error is detected Only temporarily, the communication signal SG22 of the system in which no error has occurred is selected.
- the operation example shown in FIG. 11 represents an operation for the signal selection unit 18 of the communication device 10D to select one of the two systems of communication signals SG21 and SG22.
- control is performed with a focus on matching/mismatching of the communication signals SG21 and SG22, which are the processing results of the preprocessing unit 13 in the two systems of FPGA boards 12-1 and 12-2.
- the signal selector 18 must select the one of the two communication signals SG21 and SG22 that is not affected by the error, and discard the information on the one affected by the error.
- step S31 If the comparison results in step S31 match, no CRAM error has occurred, so the communication device 10D proceeds to the process of step S32. In that case, it does not matter which of the two communication signals SG21 and SG22 is selected, but in the example of FIG. select with priority.
- step S31 If the comparison result in step S31 does not match, there is a possibility that a CRAM error has occurred in one of the two systems of FPGA1 and FPGA2. Therefore, in the next step S33, the signal selector 18 temporarily suspends the processing until the notification of the CRAM error arrives. Then, the communication device 10D proceeds to the next step S34.
- step S34 the fault management section 16D identifies whether or not the CRAM error detection section 12c has detected a CRAM error in one of the FPGA boards 12-1.
- a state in which the CRAM error detector 12c does not detect a CRAM error in the FPGA board 12-1 in step S34 means a state in which a CRAM error occurs in the other FPGA board 12-2. Therefore, in step S35, the failure management unit 16D outputs the selection control command CM4 so that the signal selection unit 18 selects the communication signal SG21, which is the processing result of the FPGA board 12-1 in which no CRAM error has occurred. Control. In this case, the information of the communication signal SG22 is discarded by the signal selector 18.
- step S34 the communication device 10D proceeds to step S36.
- the failure management unit 16D outputs a selection control command CM4 and controls the signal selection unit 18 to select the communication signal SG22, which is the result of processing on the side of the FPGA board 12-2 in which no CRAM error has occurred. .
- the information of the communication signal SG21 that is not selected is discarded by the signal selector 18.
- the signal selector 18 can send one of the communication signals SG21 and SG22 to the downstream side communication device 14 at an early stage, so an improvement in processing speed is expected.
- a general FPGA device is equipped with the function of the CRAM error detection unit 12c excluding the function of transmitting the error notification ER1, the variable logic circuit unit 12a, and the CRAM 12b.
- the two FPGA boards 12-1 and 12-2 shown in FIG. 8 are structurally symmetrical, and even if the left and right "FPGA1" and “FPGA2" are interchanged, the same result is obtained. Also, in the operation shown in FIG. 11, “FPGA1" and “FPGA2" can be interchanged.
- the preprocessing unit 13 can be duplicated by the FPGA boards 12-1 and 12-2, so that the reliability of the communication system can be improved while suppressing complication of the structure.
- FIG. 12 shows a communication device 10E of configuration example-5 according to the embodiment of the present invention.
- a communication device 10E shown in FIG. 12 is a modification of the communication device 10C shown in FIG. 5, and the same components are indicated by the same reference numerals in FIGS.
- the communication device 10E of FIG. 12 includes an upstream communication device 11C, an FPGA board 12, and a downstream communication device 14. Also, the upstream communication device 11C in FIG. 12 has a function of transmitting a known test signal.
- the communication device 10E shown in FIG. 12 includes a failure detection unit 15, a failure management unit 16E, a test signal diagnosis unit 17, and a recording device 20 as countermeasures against 1-bit soft errors in the CRAM 12b.
- the CRAM error detector 12c is configured to transmit an error notification ER1 when a 1-bit soft error occurs.
- Each of the communication devices 10, 10B, 10C, and 10D described above has a function corresponding to a situation in which a soft error occurring in the CRAM 12b leads to an error occurring in the communication signal SG2 on the downstream side.
- a CRAM error occurs, it may not affect the device.
- the logic structure of the unused area in the variable logic circuit section 12a is changed due to a CRAM error, the communication signal SG2 is not affected, so that the operation of the downstream side communication device 14 does not malfunction.
- a CRAM error even if a CRAM error occurs, it may not necessarily be treated as a failure of the device.
- the failure management unit 16E gives the test transmission request CM32 to the upstream communication device 11C for investigation.
- the upstream communication device 11C transmits the aforementioned test signal SG1x composed of known information as the communication signal SG1 in accordance with the test transmission request CM32.
- This test signal SG1x is processed by the preprocessing section 13 of the FPGA board 12 and input to the downstream side communication device 14 as the communication signal SG2.
- the test signal diagnosis unit 17 uses the correct information of the communication signal SG2 actually input to the downstream communication device 14 for the test signal SG1x and the correct information appearing in each internal part of the downstream communication device 14 as known information. Since it is known in advance, it is possible to diagnose and investigate whether or not a failure has actually occurred and the extent to which the failure has spread. The result of the diagnosis is output from the test signal diagnosis section 17 as a diagnosis result notification NO3.
- the failure management unit 16E When receiving the diagnosis result notification NO3, the failure management unit 16E outputs and records the information of the diagnosis result to the recording device 20, and when a problematic diagnosis result is input, the restart command CM5 is sent to the downstream side. It is transmitted to devices in each part of the communication system such as the communication device 14 . In other words, by outputting the restart command CM5, the entire communication system is restored to a normal state even for failures that cannot be repaired only by error correction for CRAM errors.
- FIG. 13 shows an operation example of the communication system according to the embodiment of the present invention. 13 are similar to the communication devices 10, 10B, 10C, 10D, and 10E described above; and configured to include all functions. Further, the failure management unit 16 of this communication system is configured to be able to selectively or in combination execute a plurality of types of CRAM error coping functions provided in the communication devices 10 to 10E by switching the operation mode. Each process shown in FIG. 13 is configured as a program that can be executed by a computer that controls the failure management unit 16, for example. The operation of FIG. 13 will be described below.
- step S11 the aforementioned CRAM error detection unit 12c identifies whether or not a 1-bit soft error has occurred in the CRAM 12b.
- the error notification ER1 is output from the CRAM error detector 12c. Identify mode 4 or mode 5.
- step S13 When mode 1 is selected, the fault management unit 16 performs processing in step S13 to increase the fault detection sensitivity on the downstream side with the sensitivity change command CM1, as in the case of the communication device 10 shown in FIG.
- the processing of step S ⁇ b>13 corresponds to first processing in which the failure management unit 16 temporarily increases the sensitivity of failure detection in the downstream communication device 14 .
- step S14 corresponds to a second process in which the failure management unit 16 instructs the upstream communication device 11 existing upstream of the preprocessing unit 13 to resend the signal corresponding to the failure.
- the failure management unit 16 diagnoses with a test signal by outputting a signal discard command CM31, a test transmission request CM32, and a retransmission request CM33, as in the case of the communication device 10C shown in FIG. After that, the upstream communication device 11C processes in step S15 so as to resend the original signal.
- the process of step S15 corresponds to a third process in which the failure management unit 16 instructs the upstream communication device 11 to transmit a known test signal and diagnoses the processing result of the downstream communication unit for the test signal.
- step S ⁇ b>16 corresponds to a fourth process in which the fault management unit 16 selects one normal path from the duplicated paths in the preprocessing unit 13 .
- the failure management unit 16 When mode 5 is selected, the failure management unit 16 records the diagnosis result for the test signal and, if there is a problem with the diagnosis result, repeats the diagnosis as in the case of the communication device 10E shown in FIG. In step S17, a start-up command CM5 is output so that the operation of the system is restored.
- mode 1 to mode 5 are used to cope with the occurrence of a CRAM error, but multiple processing of modes 1 to 5 can be combined and executed simultaneously.
- the failure management unit 16 takes appropriate measures in response to a soft error in the CRAM 12b detected by the CRAM error detection unit 12c. It is possible to prevent it from occurring or restore it to a normal state. Therefore, it is possible to suppress the influence on service provision to users using the communication device 10 .
- the failure management unit 16 corresponding to the sensitivity adjustment instructing unit responds to the occurrence of a CRAM error on the upstream side and at least temporarily changes the failure detection sensitivity to , it becomes possible to detect the occurrence of a failure at an early stage, and it is possible to suppress the influence of the failure from spreading to the downstream side.
- the failure management unit 16B which corresponds to the discard instruction instructing unit and the retransmission request unit, instructs the discarding of the signal corresponding to the error and retransmits the corresponding signal. is requested from the upstream communication device 11 . Therefore, it is possible to avoid affecting the downstream side communication device 14 by a temporary error that occurs in the logic structure of the variable logic circuit section 12a due to the CRAM error.
- the failure management unit 16C which is the discard instruction instruction unit, instructs the discard of the signal corresponding to the CRAM error with the signal discard instruction CM31, and the test signal request unit
- a certain failure management unit 16C requests transmission of a known test signal with a test transmission request CM32, and when a correct processing result is obtained for the test signal, the failure management unit 16C requests retransmission with a retransmission request CM33. Therefore, the original communication signal SG1 is retransmitted after confirming that the error in the logic structure of the variable logic circuit section 12a has been corrected, which helps improve reliability.
- the signal selector 18 can select an error-free signal from the two communication signals SG21 and SG22 and output it in a short period of time, which helps improve the response speed of the communication device 10D.
- the CRAM error detection unit 12c detects an error in the CRAM that determines the logic structure inside the FPGA
- the failure management units 16 to 16E detect an error in the CRAM. at least one of investigating, mitigating, and recovering from failures occurring in the downstream communication device 14 due to errors in Therefore, when an erroneous communication signal SG2 is output due to a CRAM error that has occurred in the FPGA, it is possible to appropriately deal with failures that occur on the downstream side communication device 14 side.
- an erroneous communication signal SG2 is output due to a CRAM error occurring in the FPGA. In this case, it becomes possible to appropriately deal with a failure that occurs on the downstream communication device 14 side.
- a communication device of the present invention has one or more downstream communication units that receive and process signals input from an upstream side, and a programmable device that processes signals upstream of the downstream communication units.
- a communication device comprising a preprocessing unit, a CRAM error detection unit that detects an error in a CRAM that determines the logical structure inside the programmable device; In response to occurrence of an upstream error detected by the CRAM error detection unit, executing at least one process of investigating, suppressing, and recovering from a failure that occurs in the downstream communication unit due to the upstream error. a downstream failure processing unit to Prepare.
- the downstream failure processing unit a failure detection unit that detects a failure in the downstream communication unit; a sensitivity adjustment instructing section that at least temporarily increases the fault detection sensitivity of the fault detection section compared to normal time in response to the occurrence of the upstream error;
- the sensitivity adjustment instructing unit in response to the occurrence of a CRAM error on the upstream side, at least temporarily increases the failure detection sensitivity higher than that in the steady state, so that failures occur. can be detected at an early stage, and the influence of the failure can be suppressed from spreading to the downstream side.
- the downstream failure processing unit a discard instruction instructing unit for discarding the corresponding signal in the downstream communication unit in response to the occurrence of the upstream error; a retransmission request unit that instructs an upstream communication unit existing upstream of the preprocessing unit to retransmit the corresponding signal;
- the discard instruction instructing unit instructs the discarding of the signal corresponding to the error
- the retransmission request unit requests the upstream communication unit to retransmit the corresponding signal. Therefore, it is possible to prevent a temporary error in the logic structure of the programmable device due to the CRAM error from affecting the downstream side communication section.
- the downstream failure processing unit a discard instruction instructing unit for discarding the corresponding signal in the downstream communication unit in response to the occurrence of the upstream error; a test signal requesting unit that, in response to occurrence of the upstream error, instructs an upstream communication unit existing upstream of the preprocessing unit to transmit a known test signal; a test signal diagnosis unit that identifies whether or not the downstream side communication unit has obtained a correct processing result for the test signal transmitted from the upstream side communication unit; a retransmission request unit that instructs the upstream communication unit to retransmit the discarded signal after the downstream communication unit obtains a correct processing result for the test signal;
- the communication device according to (1) or (2) above.
- the discard instruction instructing section instructs discarding of the signal corresponding to the CRAM error
- the test signal requesting section requests transmission of a known test signal
- the test signal If a correct processing result is obtained for the data, the retransmission request unit requests retransmission. Therefore, after confirming that the error in the logic structure of the programmable device has been corrected, the original communication signal is resent, which helps improve reliability.
- the preprocessing unit includes a first programmable device circuit and a second programmable device circuit connected in parallel to a signal path;
- the downstream failure processing unit in response to the occurrence of the upstream error, selects one of the signal output by the first programmable device circuit and the signal output by the second programmable device circuit that is not related to the error.
- a selection instruction unit that instructs the downstream communication unit to select the signal of The communication device according to any one of (1) to (4) above.
- the communication device having the configuration (5) above two systems, the first programmable device circuit and the second programmable device circuit, need to be prepared in advance. There is no need to wait for errors in the structure to be repaired. In other words, it is possible to select an error-free signal from the two communication signals and output it in a short period of time according to an instruction from the selection instruction section, which is useful for improving the response speed of the communication apparatus.
- a communication failure management method of the present invention includes one or more downstream communication units that receive and process signals input from an upstream side, and a programmable device that processes signals upstream of the downstream communication units.
- a communication failure management method for managing a failure in a communication device comprising: detecting an error in a CRAM that determines the logic structure inside the programmable device; When an error is detected in the CRAM, at least one process of investigating, suppressing, and recovering from a failure occurring in the downstream communication unit due to the error in the CRAM is executed.
- the communication failure management method of the present invention when an error occurs in the logic structure inside the programmable device due to a soft error in the CRAM, an erroneous communication signal output from the programmable device is transmitted to the downstream communication device. It is possible to suppress the failure of the function of
- the fault management program of the present invention includes one or more downstream side communication units that receive and process signals input from an upstream side, and a programmable device that processes signals on the upstream side of the downstream side communication unit.
- a computer-executable failure management program for managing failures in a communication device including a preprocessing unit having a procedure for detecting errors in a CRAM that determines the logic structure inside said programmable device; and a procedure for executing at least one process of investigating, suppressing, and recovering from a failure occurring in the downstream communication unit due to the error in the CRAM when an error is detected in the CRAM. be.
- the failure management program of the present invention By executing the failure management program of the present invention on a predetermined computer, when an error occurs in the logic structure inside the programmable device due to a soft error in the CRAM, an erroneous communication signal output from the programmable device is generated. It becomes possible to suppress failures in the functions of communication equipment on the downstream side.
- a communication system of the present invention has one or more downstream communication units that receive and process signals input from an upstream side, and a programmable device that processes signals upstream of the downstream communication units.
- a communication system including a preprocessing unit and a failure management unit that manages failures in the downstream communication unit, A CRAM error detection unit that detects an error in a CRAM that determines the logical structure inside the programmable device, The failure management unit performs a first process to temporarily increase the sensitivity of failure detection in the downstream communication unit, and resends a signal corresponding to the failure to the upstream communication unit existing upstream of the preprocessing unit.
- the failure management unit performs one or more of the first process, the second process, the third process, and the fourth process in response to occurrence of an upstream error detected by the CRAM error detection unit. configured to run
- an erroneous communication signal output from the programmable device is used to determine the function of the downstream communication device. It is possible to suppress the occurrence of failures in
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Electromagnetism (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Detection And Prevention Of Errors In Transmission (AREA)
Abstract
Description
前記プログラマブルデバイス内部の論理構造を決定するCRAMにおけるエラーを検出するCRAMエラー検出部と、
前記CRAMエラー検出部が検出した上流側エラーの発生に応答して、前記上流側エラーに起因して前記下流側通信部で発生する故障の調査、抑制、及び回復のうち少なくとも1つの処理を実行する下流故障処理部と、
を備える。
その他の発明については、以下の実施形態にて詳細に説明する。
<通信装置の構成例-1>
-<基本的な構成の説明>
本発明の実施形態における通信装置の構成例-1を図1に示す。
ここで考慮すべき重要な事項は、CRAM12bの内部でソフトエラーが発生する可能性があることである。すなわち、半導体の微細化に伴って、宇宙線由来のソフトエラーが発生する確率が高まっている。そのため、ソフトエラーによりCRAM12b内部で事前にプログラミングされた本来のデータとは異なるビットが発生し、可変論理回路部12aの論理回路構成に誤りが生じる可能性がある。
図1に示した通信装置10は、CRAM12bにおける1ビットのソフトエラー対策として、故障検出部15及び故障管理部16を備えている。これら故障検出部15及び故障管理部16は、CRAMエラー検出部12cが検出した上流側エラーの発生に応答して、この上流側エラーに起因して下流側通信デバイス14で発生する故障の調査、抑制、及び回復のうち少なくとも1つの処理を実行する下流故障処理部として動作する。また、1ビットのソフトエラー発生時にCRAMエラー検出部12cがエラー通知ER1を送信するように構成してある。故障管理部16は、CRAMエラー検出部12cからエラー通知ER1を受信すると、感度変更命令CM1を故障検出部15に送信する。故障検出部15は、故障管理部16からの感度変更命令CM1に従い、下流側通信デバイス14における故障検出の感度を定常時に比べて一時的に高くする。これにより、故障管理部16は、上流側エラーの発生に応答して、少なくとも一時的に故障検出部15の故障検出感度を定常時に比べて高くする感度調整指示部として動作する。
図1の通信装置10の主要な動作例を図2に示す。
CRAM12bでソフトエラーが発生すると、それをCRAMエラー検出部12cが検出して図2のようにエラー通知ER1を送出する。故障管理部16は、受信したエラー通知ER1に従い、感度変更命令CM1を送信する。故障検出部15は、受信した感度変更命令CM1に従い、下流側通信デバイス14に対する故障検出感度を上げる。
本発明の実施形態における構成例-2の通信装置10Bを図3に示す。また、図3の通信装置10Bにおける主要な動作例を図4に示す。図3の通信装置10Bは図1の通信装置10の変形例であり、図1、図3において同一の構成要素は同一の符号を付けて示してある。
本発明の実施形態における構成例-3の通信装置10Cを図5に示す。また、図5の通信装置10Cにおける動作タイミング及び動作手順の例を図6及び図7にそれぞれ示す。図5に示した通信装置10Cは、図3の通信装置10Bの変形例であり、図3、図5において同一の構成要素は同一の符号を付けて示してある。
本発明の実施形態における構成例-4の通信装置10Dを図8に示す。図8に示した通信装置10Dは、図1の通信装置10の変形例であり、図1、図8において同一の構成要素は同一の符号を付けて示してある。
図8の通信装置10Dにおける動作例-1、動作例-2、及び動作例-3をそれぞれ図9、図10、及び図11に示す。
図9に示した動作例は、通信装置10Dの信号選択部18が2系統の通信信号SG21、SG22のいずれか一方を選択するための動作を表している。CRAM12bでソフトエラーが発生していない時には、2つのFPGAボード12-1、12-2の前処理部13が同じ処理を行うので、信号選択部18が2系統の通信信号SG21、SG22のどちらを選択しても問題は生じない。本実施形態では、初期状態で信号選択部18が一方の通信信号SG21を選択する。
図10に示した動作例は、通信装置10Dの信号選択部18が2系統の通信信号SG21、SG22のいずれか一方を選択するための動作を表している。CRAM12bでソフトエラーが発生していない時には、2つのFPGAボード12-1、12-2の前処理部13が同じ処理を行うので、信号選択部18が2系統の通信信号SG21、SG22のどちらを選択しても問題は生じない。本実施形態では、初期状態、及び定常状態で信号選択部18が一方の通信信号SG21を選択する。
図11に示した動作例は、通信装置10Dの信号選択部18が2系統の通信信号SG21、SG22のいずれか一方を選択するための動作を表している。図11の動作例では、2系統のFPGAボード12-1、12-2における前処理部13の処理結果である通信信号SG21、SG22の一致/不一致に着目した制御を実施している。
ステップS34でCRAMエラー検出部12cがFPGAボード12-1におけるCRAMエラーを検出しない状態は、他方のFPGAボード12-2でCRAMエラーが発生している状態を意味する。したがって、ステップS35ではCRAMエラーが発生していないFPGAボード12-1側の処理結果である通信信号SG21を信号選択部18が選択するように、故障管理部16Dが選択制御命令CM4を出力して制御する。この場合は通信信号SG22の情報は信号選択部18で破棄される。
つまり、図11に示した動作を実行する場合には、予備系側にCRAMエラー検出部12cの機能を搭載する必要がないので、通信装置10Dの設備コストを低減できる。
本発明の実施形態における構成例-5の通信装置10Eを図12に示す。図12に示した通信装置10Eは、図5の通信装置10Cの変形例であり、図5、図12において同一の構成要素は同一の符号を付けて示してある。
本発明の実施形態における通信システムの動作例を図13に示す。図13の動作を実行する通信システム(図示せず)の主要部は、前述の通信装置10、10B、10C、10D、及び10Eと同様であり、例えば通信装置10~10Eにおける全ての構成要素、及び全ての機能を含むように構成される。また、この通信システムの故障管理部16は動作モードの切替により通信装置10~10Eに備わっている複数種類のCRAMエラー対応機能を選択的に、あるいは組み合わせて実行できるように構成されている。また、図13に示した各処理は例えば故障管理部16を制御するコンピュータが実行可能なプログラムとして構成される。図13の動作について以下に説明する。
前述の各通信装置10~10Eは、CRAMエラー検出部12cが検出したCRAM12bのソフトエラーを契機として、故障管理部16が適切な対処をするので、下流側通信デバイス14側で発生する故障を未然に防止したり、正常な状態に回復させたりすることが可能になる。したがって、通信装置10を利用するユーザへのサービス提供に与える影響を抑制できる。
(1)本発明の通信装置は、上流側から入力される信号を受け取って処理する1つ以上の下流側通信部と、前記下流側通信部よりも上流側で信号を処理するプログラマブルデバイスを有する前処理部とを含む通信装置であって、
前記プログラマブルデバイス内部の論理構造を決定するCRAMにおけるエラーを検出するCRAMエラー検出部と、
前記CRAMエラー検出部が検出した上流側エラーの発生に応答して、前記上流側エラーに起因して前記下流側通信部で発生する故障の調査、抑制、及び回復のうち少なくとも1つの処理を実行する下流故障処理部と、
を備える。
前記下流側通信部における故障を検知する故障検出部と、
前記上流側エラーの発生に応答して、少なくとも一時的に前記故障検出部の故障検出感度を定常時に比べて高くする感度調整指示部と、
を含む上記(1)に記載の通信装置。
前記上流側エラーの発生に応答して、該当する信号を前記下流側通信部で破棄するための破棄命令指示部と、
該当する信号の再送を前記前処理部よりも上流側に存在する上流側通信部に対して指示する再送要求部と、
を含む上記(1)又は上記(2)に記載の通信装置。
前記上流側エラーの発生に応答して、該当する信号を前記下流側通信部で破棄するための破棄命令指示部と、
前記上流側エラーの発生に応答して、既知のテスト信号の送信を前記前処理部よりも上流側に存在する上流側通信部に対して指示するテスト信号要求部と、
前記下流側通信部が前記上流側通信部から送信された前記テスト信号に対して正しい処理結果を得たか否かを識別するテスト信号診断部と、
前記下流側通信部が前記テスト信号に対して正しい処理結果を得た後で、破棄された信号の再送を前記上流側通信部に対して指示する再送要求部と、
を含む上記(1)又は上記(2)に記載の通信装置。
前記下流故障処理部は、前記上流側エラーの発生に応答して、前記第1のプログラマブルデバイス回路が出力した信号、及び前記第2のプログラマブルデバイス回路が出力した信号のうちエラーと関連のない一方の信号の選択を前記下流側通信部に指示する選択指示部を備える、
上記(1)乃至上記(4)のいずれかに記載の通信装置。
前記プログラマブルデバイス内部の論理構造を決定するCRAMにおけるエラーを検出し、
前記CRAMにおけるエラー検出の発生時に、前記CRAMのエラーに起因して前記下流側通信部で発生する故障の調査、抑制、及び回復の少なくとも1つの処理を実行する。
前記プログラマブルデバイス内部の論理構造を決定するCRAMにおけるエラーを検出する手順と、
前記CRAMにおけるエラー検出の発生時に、前記CRAMのエラーに起因して前記下流側通信部で発生する故障の調査、抑制、及び回復の少なくとも1つの処理を実行する手順と、を含むように構成される。
前記プログラマブルデバイス内部の論理構造を決定するCRAMにおけるエラーを検出するCRAMエラー検出部を備え、
前記故障管理部が、前記下流側通信部における故障検出の感度を一時的に上げる第1処理、故障に対応する信号の再送を前記前処理部よりも上流側に存在する上流側通信部に対して指示する第2処理、既知のテスト信号の送信を前記上流側通信部に指示すると共に前記テスト信号に対する前記下流側通信部の処理結果を診断する第3処理、及び前記前処理部における二重化された経路のうち正常な一方の経路を選択する第4処理、のうち1つ以上の処理を実行可能であり、
前記故障管理部は、前記CRAMエラー検出部が検出した上流側エラーの発生に応答して前記第1処理、前記第2処理、前記第3処理、及び前記第4処理のうち1つ以上の処理を実行するように構成される。
11,11C 上流側通信デバイス
12,12-1,12-2 FPGAボード
12a 可変論理回路部
12b CRAM
12c CRAMエラー検出部
13 前処理部
14 下流側通信デバイス
14a 下流側通信デバイス本体
14b 信号保留部
15 故障検出部
16,16B,16C,16D,16E 故障管理部
17 テスト信号診断部
18 信号選択部
19 信号保留部
20 記録装置
ER1 エラー通知
CM1 感度変更命令
CM21,CM31 信号破棄命令
CM22,CM33 再送要求
CM32 テスト送信要求
CM4 選択制御命令
CM5 再起動命令
NO3 診断結果通知
SG1,SG2,SG21,SG22 通信信号
SG1x テスト信号
T1 制御期間
Claims (8)
- 上流側から入力される信号を受け取って処理する1つ以上の下流側通信部と、前記下流側通信部よりも上流側で信号を処理するプログラマブルデバイスを有する前処理部とを含む通信装置であって、
前記プログラマブルデバイス内部の論理構造を決定するCRAMにおけるエラーを検出するCRAMエラー検出部と、
前記CRAMエラー検出部が検出した上流側エラーの発生に応答して、前記上流側エラーに起因して前記下流側通信部で発生する故障の調査、抑制、及び回復のうち少なくとも1つの処理を実行する下流故障処理部と、
を備えた通信装置。 - 前記下流故障処理部は、
前記下流側通信部における故障を検知する故障検出部と、
前記上流側エラーの発生に応答して、少なくとも一時的に前記故障検出部の故障検出感度を定常時に比べて高くする感度調整指示部と、
を含む請求項1に記載の通信装置。 - 前記下流故障処理部は、
前記上流側エラーの発生に応答して、該当する信号を前記下流側通信部で破棄するための破棄命令指示部と、
該当する信号の再送を前記前処理部よりも上流側に存在する上流側通信部に対して指示する再送要求部と、
を含む請求項1又は請求項2に記載の通信装置。 - 前記下流故障処理部は、
前記上流側エラーの発生に応答して、該当する信号を前記下流側通信部で破棄するための破棄命令指示部と、
前記上流側エラーの発生に応答して、既知のテスト信号の送信を前記前処理部よりも上流側に存在する上流側通信部に対して指示するテスト信号要求部と、
前記下流側通信部が前記上流側通信部から送信された前記テスト信号に対して正しい処理結果を得たか否かを識別するテスト信号診断部と、
前記下流側通信部が前記テスト信号に対して正しい処理結果を得た後で、破棄された信号の再送を前記上流側通信部に対して指示する再送要求部と、
を含む請求項1又は請求項2に記載の通信装置。 - 前記前処理部が信号経路に対して並列接続された第1のプログラマブルデバイス回路と、第2のプログラマブルデバイス回路とを含み、
前記下流故障処理部は、前記上流側エラーの発生に応答して、前記第1のプログラマブルデバイス回路が出力した信号、及び前記第2のプログラマブルデバイス回路が出力した信号のうちエラーと関連のない一方の信号の選択を前記下流側通信部に指示する選択指示部を備える、
請求項1乃至請求項4のいずれか1項に記載の通信装置。 - 上流側から入力される信号を受け取って処理する1つ以上の下流側通信部と、前記下流側通信部よりも上流側で信号を処理するプログラマブルデバイスを有する前処理部とを含む通信装置における故障を管理するための通信故障管理方法であって、
前記プログラマブルデバイス内部の論理構造を決定するCRAMにおけるエラーを検出し、
前記CRAMにおけるエラー検出の発生時に、前記CRAMのエラーに起因して前記下流側通信部で発生する故障の調査、抑制、及び回復のうち少なくとも1つの処理を実行する、
通信故障管理方法。 - 上流側から入力される信号を受け取って処理する1つ以上の下流側通信部と、前記下流側通信部よりも上流側で信号を処理するプログラマブルデバイスを有する前処理部とを含む通信装置における故障を管理するコンピュータが実行可能な故障管理プログラムであって、
前記プログラマブルデバイス内部の論理構造を決定するCRAMにおけるエラーを検出する手順と、
前記CRAMにおけるエラー検出の発生時に、前記CRAMのエラーに起因して前記下流側通信部で発生する故障の調査、抑制、及び回復のうち少なくとも1つの処理を実行する手順と、
を含む故障管理プログラム。 - 上流側から入力される信号を受け取って処理する1つ以上の下流側通信部と、前記下流側通信部よりも上流側で信号を処理するプログラマブルデバイスを有する前処理部と、前記下流側通信部における故障を管理する故障管理部とを含む通信システムであって、
前記プログラマブルデバイス内部の論理構造を決定するCRAMにおけるエラーを検出するCRAMエラー検出部を備え、
前記故障管理部が、前記下流側通信部における故障検出の感度を一時的に上げる第1処理、故障に対応する信号の再送を前記前処理部よりも上流側に存在する上流側通信部に対して指示する第2処理、既知のテスト信号の送信を前記上流側通信部に指示すると共に前記テスト信号に対する前記下流側通信部の処理結果を診断する第3処理、及び前記前処理部における二重化された経路のうち正常な一方の経路を選択する第4処理、のうち1つ以上の処理を実行可能であり、
前記故障管理部は、前記CRAMエラー検出部が検出した上流側エラーの発生に応答して前記第1処理、前記第2処理、前記第3処理、及び前記第4処理のうち1つ以上の処理を実行する、
通信システム。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2021/005657 WO2022176006A1 (ja) | 2021-02-16 | 2021-02-16 | 通信装置、通信故障管理方法、故障管理プログラム、及び通信システム |
US18/276,501 US20240126628A1 (en) | 2021-02-16 | 2021-02-16 | Communication device, communication failure management method, failure management program, and communication system |
JP2023500134A JPWO2022176006A1 (ja) | 2021-02-16 | 2021-02-16 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2021/005657 WO2022176006A1 (ja) | 2021-02-16 | 2021-02-16 | 通信装置、通信故障管理方法、故障管理プログラム、及び通信システム |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022176006A1 true WO2022176006A1 (ja) | 2022-08-25 |
Family
ID=82931252
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2021/005657 WO2022176006A1 (ja) | 2021-02-16 | 2021-02-16 | 通信装置、通信故障管理方法、故障管理プログラム、及び通信システム |
Country Status (3)
Country | Link |
---|---|
US (1) | US20240126628A1 (ja) |
JP (1) | JPWO2022176006A1 (ja) |
WO (1) | WO2022176006A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117648287A (zh) * | 2024-01-30 | 2024-03-05 | 山东云海国创云计算装备产业创新中心有限公司 | 一种片上数据处理系统、方法、服务器及电子设备 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2016052074A (ja) * | 2014-09-02 | 2016-04-11 | 株式会社日立製作所 | 通信装置 |
JP2018181206A (ja) * | 2017-04-20 | 2018-11-15 | 日本電気株式会社 | データ処理装置、データ処理方法およびプログラム |
JP2021015357A (ja) * | 2019-07-10 | 2021-02-12 | 株式会社日立製作所 | 計算機システム、制御方法およびプログラム |
-
2021
- 2021-02-16 WO PCT/JP2021/005657 patent/WO2022176006A1/ja active Application Filing
- 2021-02-16 JP JP2023500134A patent/JPWO2022176006A1/ja active Pending
- 2021-02-16 US US18/276,501 patent/US20240126628A1/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2016052074A (ja) * | 2014-09-02 | 2016-04-11 | 株式会社日立製作所 | 通信装置 |
JP2018181206A (ja) * | 2017-04-20 | 2018-11-15 | 日本電気株式会社 | データ処理装置、データ処理方法およびプログラム |
JP2021015357A (ja) * | 2019-07-10 | 2021-02-12 | 株式会社日立製作所 | 計算機システム、制御方法およびプログラム |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117648287A (zh) * | 2024-01-30 | 2024-03-05 | 山东云海国创云计算装备产业创新中心有限公司 | 一种片上数据处理系统、方法、服务器及电子设备 |
CN117648287B (zh) * | 2024-01-30 | 2024-05-03 | 山东云海国创云计算装备产业创新中心有限公司 | 一种片上数据处理系统、方法、服务器及电子设备 |
Also Published As
Publication number | Publication date |
---|---|
JPWO2022176006A1 (ja) | 2022-08-25 |
US20240126628A1 (en) | 2024-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7362697B2 (en) | Self-healing chip-to-chip interface | |
CN107276710B (zh) | 基于时间同步状态监控的时间触发以太网故障诊断方法 | |
WO2022176006A1 (ja) | 通信装置、通信故障管理方法、故障管理プログラム、及び通信システム | |
US8914683B2 (en) | Repairing high-speed serial links | |
US20100050062A1 (en) | Sending device, receiving device, communication control device, communication system, and communication control method | |
US20090100288A1 (en) | Fast software fault detection and notification to a backup unit | |
US20140298076A1 (en) | Processing apparatus, recording medium storing processing program, and processing method | |
US7593323B2 (en) | Apparatus and methods for managing nodes on a fault tolerant network | |
KR101566640B1 (ko) | 임의의 통신망 오류에 대처 가능한 이중화 can 통신 장치 및 방법, 그 방법을 수행하기 위한 기록 매체 | |
US20060274646A1 (en) | Method and apparatus for managing network connection | |
US8111625B2 (en) | Method for detecting a message interface fault in a communication device | |
JP2016052074A (ja) | 通信装置 | |
US7366952B2 (en) | Interconnect condition detection using test pattern in idle packets | |
WO2017166064A1 (zh) | 一种业务故障处理的方法、装置及设备 | |
JP2009075719A (ja) | 冗長構成装置及びその自己診断方法 | |
EP3316135B1 (en) | Control system | |
JP2016213979A (ja) | 保護制御装置および保護制御システム | |
Wunderlich et al. | Multi-layer test and diagnosis for dependable nocs | |
Nambinina et al. | Adaptive Time-Triggered Network-on-Chip Architecture: Enhancing Safety | |
JP2018148421A (ja) | ネットワーク監視装置、ネットワーク監視システム、ネットワーク監視方法及びプログラム | |
JP4623001B2 (ja) | 障害切り分けシステム、障害切り分け方法、およびプログラム | |
WO2024056184A1 (en) | Ethernet device with safety features at the physical layer and method for a bi-directional data transfer between two ethernet devices | |
US20060107109A1 (en) | Communication processing apparatus and method and program for diagnosing the same | |
JP2023182975A (ja) | ネットワークシステム及びその制御方法 | |
JP2022088756A (ja) | プラント制御システムの通信装置及び通信方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21926446 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2023500134 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 18276501 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21926446 Country of ref document: EP Kind code of ref document: A1 |