CN110989427A

CN110989427A - Fault detection and health management method for multiprocessor computer

Info

Publication number: CN110989427A
Application number: CN201911133703.0A
Authority: CN
Inventors: 窦爱萍; 封安; 吴志川; 隽鹏辉; 原晨
Original assignee: Xian Aeronautics Computing Technique Research Institute of AVIC
Current assignee: Xian Aeronautics Computing Technique Research Institute of AVIC
Priority date: 2019-11-19
Filing date: 2019-11-19
Publication date: 2020-04-10

Abstract

The invention provides a missile-borne inertial navigation system-oriented core computing platform fault detection and health management method based on a multi-core DSP (digital signal processor), which realizes fault detection and health management of a missile-borne navigation system-oriented core computing platform through the processes of data acquisition and transmission, data preprocessing, feature extraction, state monitoring and health management by methods of multi-core processor heartbeat monitoring, multi-core heartbeat monitoring, multi-power supply monitoring, atmospheric pressure and ambient temperature monitoring, high-speed 1553B bus monitoring and the like.

Description

Fault detection and health management method for multiprocessor computer

Technical Field

The invention belongs to the technical field of embedded systems, and particularly relates to a fault detection and health management method of a multiprocessor computer.

Background

The missile-borne inertial navigation system has the characteristics of poor detachability of an installation mode, periodic exposure of a service environment under extreme temperature and humidity conditions, less task times in a life cycle, long storage period and the like, so that a core computing platform in the missile-borne inertial navigation system is limited by the service conditions, and needs to have strong fault detection and health management capabilities, and meanwhile, a fault detection mechanism needs to have low false alarm rate, high fault detection rate and fault isolation capability.

The application range of the fault detection and health management technology of countries in Europe and America covers various advanced weaponry, and the quantity is large. In the aspect of the implementation effect of the PHM technology, the F-35 airplane is the most obvious, the failure irreproducibility rate of the airplane is reduced by 82 percent, the maintenance manpower is reduced by 20 to 40 percent, the logistics scale is reduced by 50 percent, the number of the operation stands is improved by 25 percent, the use and guarantee cost of the airplane is reduced by more than 50 percent compared with that of the past airplane, and the service life reaches 8000 flight hours. Statistical data fully proves the important functions of the PHM in reducing maintenance and guarantee cost, improving the safety, availability and integrity of weaponry, ensuring the success of tasks and improving the combat efficiency.

The patent of Beijing automation control equipment research institute, which is applied for 'an inertia/satellite deep combination information processing hardware platform based on multi-core DSP', solves the technical problems of the requirements of an inertia/satellite deep combination navigation information processing algorithm on data sharing, clock synchronization, operation instantaneity and operation capability. The invention only mentions the implementation mechanism of hardware resources, but does not consider the method of fault detection and health management in the hardware system.

The navigation resolving device based on the heterogeneous multi-core architecture, applied by the institute of optoelectronic technology in China, solves the problems of a high-speed data exchange mechanism and a message synchronization mechanism among all basic processing units in a multi-core system, has an extensible and tailorable hardware structure, can adapt to various navigation requirements and processing methods, and has good instantaneity, flexibility and reliability. But are not related to fault detection and health management related descriptions of multicore processor cores and peripheral circuits in navigation solvers.

Disclosure of Invention

The invention aims to provide a fault detection and health management method of a multiprocessor computer, which aims at fault detection and health management in a core computing platform based on a multi-core DSP in an airborne environment application, aims at a fault detection mechanism of functional components such as multi-core processing, temperature and air pressure acquisition, a high-speed serial bus and the like, and is used for meeting the requirements of autonomous fault detection and health management of an embedded environment core processing platform.

The invention is realized by adopting the following technical scheme:

a fault detection and health management method of a multiprocessor computer is based on a core computing platform which is composed of a data acquisition part, a data processing part and a state monitoring and fault diagnosis part, wherein the data acquisition part comprises a plurality of sensors, a processor monitoring module and a bus interface module, and the fault detection and health management method specifically comprises the following steps:

(1) the fault detection of the core circuit of the multi-core DSP is realized by designing a multi-core heartbeat circuit;

(2) monitoring and predicting faults of a secondary power supply network;

(3) collecting and monitoring a severe environment parameter sensor;

(4) and monitoring high-speed (4M)1553B bus communication.

Preferably, the multi-core DSP adopts a DSP-Q6713J/500 chip.

Preferably, in the step (1), the DSP-Q6713J/500 includes 4 independent EMIF buses, a heartbeat register is expanded by using FPGA resources through the EMIF buses, each independent core maintains the heartbeat register after the system is powered on and runs, and stores the maintenance result in the corresponding computing unit, the heartbeat of each core is detected by the main control unit, and when the heartbeats of other units are abnormal, the corresponding fault code is sent to the system control unit through the bus, thereby realizing fault monitoring of the multi-core processor functional circuit.

Preferably, the step (2) is used for implementing quantitative testing of the complex secondary power supply network, and judging the working state trend of the secondary power supply through data acquisition and data analysis, so as to estimate the health of the secondary power supply.

Preferably, the step (2) is specifically to perform a full-edge scan test within a voltage range, design a voltage conditioning and a cross switch in an output network, obtain an amplitude state of the secondary power supply through high-speed a/D sampling, observe a change condition of the secondary power supply output through a statistical quantity value, since the performance of the secondary power supply is generally gradual change until failure, and monitor the secondary power supply output voltage through long-time data sampling and analysis.

Preferably, in the step (3), the environment parameters where the system is located are acquired by adopting a high-precision sensor, and the reliability of the operating environment of the system is judged by acquiring, filtering and estimating through the environment sensor.

Preferably, in the step (4), the detection of the high-speed 1553B bus is performed by using an offline mode for communication detection.

Preferably, the specific method for performing communication detection in the offline mode is that, as an RT in the system, control and test instructions are sent by a BC of the system, in a power-on BIT of the high-speed 1553B bus, a corresponding vector word command test and a corresponding data test are performed on an A, B channel of the 1553B bus, after all channel switching tests of the A, B bus pass, the power-on BIT test of the system is calculated to pass, and it is ensured that both redundant channels are subjected to a coverage test in the power-on BIT.

The invention has the technical effects that: the invention solves the problem of in-board resource fault detection and health management of navigation core resources, realizes a fault prediction mechanism for embedded equipment used in severe environment, and improves the reliability of the system.

Drawings

FIG. 1 is a schematic diagram of core computing platform fault monitoring and health management data flow.

FIG. 2 is a diagram of a core computing platform interface and functionality.

Fig. 3 is a secondary power supply network for a core computing platform.

FIG. 4 is a schematic diagram of a circuit of a platinum resistance temperature measurement RTD.

Fig. 5 is a schematic view of the principle of air pressure acquisition.

FIG. 6 is a high speed 1553B bus interface circuit.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The specific steps in which the present invention is implemented are further described with reference to fig. 1-6. The fault detection and health management method of the multiprocessor computer is successfully implemented in a navigation system.

A fault detection and health management method of a multiprocessor computer is based on a core computing platform which is composed of a data acquisition part, a data processing part and a state monitoring and fault diagnosis part, wherein the architecture of the core computing platform is shown in figure 1, the data acquisition part comprises a plurality of sensors, a processor monitoring module and a bus interface module, and the fault detection and health management method specifically comprises the following steps:

(2) monitoring and predicting faults of a secondary power supply network;

(3) collecting and monitoring a severe environment parameter sensor;

(4) and monitoring high-speed (4M)1553B bus communication.

The multi-core DSP is a multi-core DSP-Q6713J/500 chip developed by the national defense science, and the SMQ2V1000 chip is selected as a programmable logic chip. The processor comprises 4 cores based on C6713, and the highest working dominant frequency can reach 500 MHz.

The four kernels in the system respectively realize the following work:

the first kernel realizes the management of a 1553B bus, a high-speed serial bus, a low-speed serial bus and the switching of internal and external frequency standards;

the second kernel realizes the control of the motor and the acquisition of the locking mechanism;

the third kernel realizes the self-calibration function of the user;

and the fourth kernel is used for health management and fault diagnosis decision and judgment.

In the step (1), the DSP-Q6713J/500 includes 4 independent EMIF buses, a heartbeat register is expanded by using FPGA resources through the EMIF buses, each independent core maintains the heartbeat register after the system is powered on and runs, and stores the maintenance result in the corresponding computing unit, the heartbeat of each core is detected by the main control unit, and when the heartbeats of other units are abnormal, the corresponding fault code is sent to the system control unit through the bus, thereby realizing fault monitoring of the multi-core processor functional circuit, as shown in fig. 2. Meanwhile, the register is interconnected with an external test interface in a discrete quantity mode, and 4 paths of DSP discrete output signals are all subjected to optical coupling isolation. The signal isolation design uses HCPL-6651 or HCPL-0631. The operating health of the multicore processor may also be obtained through external testing.

In the step (2), the core computing platform bears many functions such as various interfaces, buses, motor servo circuits and the like, as shown in fig. 2, the secondary power supply network is also complex, and a fault of any one of the secondary power supplies may affect normal operation of the system, so that monitoring of the secondary power supply network is a key point related to healthy operation of the system.

The power supply monitoring is to carry out quantitative test on the output of the secondary power supply, carry out full-edge scanning test in a voltage range, design a voltage conditioning and a cross switch in an output network, and obtain the amplitude state of the secondary power supply through high-speed A/D sampling. Through counting the amount value, observe the change condition of secondary power output, because the performance degradation of secondary power is gradual change nature until inefficacy generally, through long-time data sampling and analysis, monitor secondary power output voltage to the condition of failing to the power carries out the early warning.

The temperature and air pressure acquisition circuit in the step (3) not only provides compensation parameters of functional components for the system, but also monitors whether the sensor body works in an expected working environment, and when the environmental temperature and the air pressure are severe, the acquisition precision of the sensor can be greatly reduced, so that the working temperature environment and the working air pressure environment of the sensor can be monitored, and when the environment is severe, the temperature and the air pressure environment can be timely reported to the system.

The core computing platform adopts a platinum resistance sensor to carry out temperature acquisition, as shown in fig. 4, the temperature acquisition circuit is designed based on a special AD chip ADS1148 for high-precision temperature measurement. The ADS1148 of TI company is a highly integrated 16-bit precise ADC chip, and two identical constant current source IDACs are arranged inside the ADS1148, so that the introduction of measurement errors is avoided. The ADS1148 analog-to-digital conversion chip belongs to a special data collector for temperature measurement, one ADS1148 can be externally connected with 4 paths of analog signals input in a differential mode, 2 identical constant current source IDACs are provided, the ADS1148 and the DSP are communicated by adopting an SPI bus, and the ADS1148 is a highly integrated 16-bit precise ADC chip. The three-wire RTD (platinum resistor) connection method adopts a ratio structure to generate reference voltage, and improves the precision of the system. Meanwhile, aiming at the nonlinear characteristic of the temperature sensor, the nonlinear error of the temperature measurement is corrected and compensated by adopting a least square polynomial fitting and dichotomy table look-up method, so that the temperature measurement precision of the system is further improved.

The electrical characteristics of an analog signal interface of atmospheric pressure are temporarily set to be 0-5V, the consistence of chips, interfaces and logic is considered, an ADC chip for collecting the atmospheric pressure and voltage by a platinum resistance temperature measuring circuit is also adopted, after FX147 conditioning is carried out at the front end, the atmospheric pressure and voltage are input into ADS1148, and the circuit principle is shown in figure 5.

In step (4), the high-speed 1553B bus is a communication bus of the system, the high-speed 1553B bus interface circuit is as shown in fig. 6, and all commands of the system are issued through the 1553B bus. The reliability of bus operation is directly related to the reliability of the interaction of the core processing platform and the system.

The detection of a high-speed 1553B bus in a core processing platform adopts an off-line mode to carry out communication detection, namely, the off-line A/B channel is respectively sent and responded to the high-speed bus, and the physical link layer detection is carried out on the redundant A/B channel through the instruction sending of the fixed channel and the instruction returning of the fixed channel, so that the reliability of a bus link is ensured. As the RT in the system, both control and test commands are issued by the BC of the system. In the power-on BIT of the high-speed 1553B bus, corresponding vector word command test and data test are respectively carried out on an A, B channel of the 1553B bus, and after the A, B bus channel switching test is passed, the power-on BIT test of the system is calculated to be passed, so that the two redundant channels are subjected to the coverage test in the power-on BIT.

Claims

1. A fault detection and health management method of a multiprocessor computer is based on a core computing platform which is composed of a data acquisition part, a data processing part and a state monitoring and fault diagnosis part, wherein the data acquisition part comprises a plurality of sensors, a processor monitoring module and a bus interface module, and is characterized in that the fault detection and health management method specifically comprises the following steps:

(2) monitoring and predicting faults of a secondary power supply network;

(3) collecting and monitoring a severe environment parameter sensor;

(4) and monitoring high-speed 1553B bus communication.

2. A method of fault detection and health management for a multiprocessor computer as claimed in claim 1, wherein: the multi-core DSP adopts a DSP-Q6713J/500 chip.

3. A fault detection and health management method of a multiprocessor computer as claimed in claim 1 or 2, characterized in that: in the step (1), the DSP-Q6713J/500 chip comprises 4 independent EMIF buses, a heartbeat register is expanded by utilizing FPGA resources through the EMIF buses, each independent core maintains the heartbeat register after the system is powered on and runs, maintenance results are stored in the corresponding computing unit, the heartbeat of each core is detected by the main control unit, and when the heartbeat of other units is abnormal, the corresponding fault code is sent to the system control unit through the buses, so that fault monitoring of the multi-core processor functional circuit is realized.

4. A method of fault detection and health management for a multiprocessor computer as claimed in claim 1, wherein: and the step (2) is used for realizing quantitative test of the complex secondary power supply network, and judging the working state trend of the secondary power supply through data acquisition and data analysis so as to estimate the health of the secondary power supply.

5. A fault detection and health management method of a multiprocessor computer as claimed in claim 1 or 4, characterized in that: specifically, the step (2) is to perform a full-edge scanning test within a voltage range, design a voltage conditioning and a cross switch in an output network, obtain the amplitude state of the secondary power supply through high-speed A/D sampling, observe the change condition of the secondary power supply output through a statistical value, and monitor the secondary power supply output voltage through long-time data sampling and analysis because the performance degradation of the secondary power supply is gradual change until failure.

6. A method of fault detection and health management for a multiprocessor computer as claimed in claim 1, wherein: and (3) acquiring the environmental parameters of the system by adopting a high-precision sensor, and acquiring, filtering and estimating by using an environmental sensor so as to judge the reliability of the operating working environment of the system.

7. A method of fault detection and health management for a multiprocessor computer as claimed in claim 1, wherein: in the step (4), the detection of the high-speed 1553B bus is performed by adopting an offline mode for communication detection.

8. A method of fault detection and health management for a multiprocessor computer as claimed in claim 7, wherein: the specific method for carrying out communication detection in the off-line mode comprises the steps that the instructions are used as RT in a system, control and test instructions are sent out by a BC of the system, corresponding vector word command tests and data tests are respectively carried out on A, B channels of a 1553B bus in a power-on BIT of a high-speed 1553B bus, after A, B bus channel switching tests are passed, the power-on BIT of the system is calculated to be passed, and it is guaranteed that two redundant channels are subjected to coverage tests in the power-on BIT.