CN115686890A - Processor fault early warning method, system, electronic equipment and medium - Google Patents

Processor fault early warning method, system, electronic equipment and medium Download PDF

Info

Publication number
CN115686890A
CN115686890A CN202211181971.1A CN202211181971A CN115686890A CN 115686890 A CN115686890 A CN 115686890A CN 202211181971 A CN202211181971 A CN 202211181971A CN 115686890 A CN115686890 A CN 115686890A
Authority
CN
China
Prior art keywords
register
target data
processor
early warning
polling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211181971.1A
Other languages
Chinese (zh)
Inventor
王然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202211181971.1A priority Critical patent/CN115686890A/en
Publication of CN115686890A publication Critical patent/CN115686890A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a method and a system for early warning of processor faults, electronic equipment and a medium, and belongs to the technical field of processor fault early warning. The method comprises the following steps: performing periodic polling reading on a register in a register matrix according to a preset polling sequence to obtain target data stored in the register; according to the validity judgment, determining the target data with validity in the register as valid target data and storing the valid target data; and generating corresponding diagnosis information by diagnosing and analyzing the effective target data, and generating a corresponding early warning log based on the diagnosis information and feeding the early warning log back to a user. The method aims to early warn and make response before the processor really breaks down so as to avoid serious loss caused by response after the processor breaks down.

Description

Processor fault early warning method, system, electronic equipment and medium
Technical Field
The invention relates to the technical field of processor fault early warning, in particular to a method, a system, electronic equipment and a medium for early warning of processor faults.
Background
In the server component, a CPU is a core component for the operation of the server, and all services executed by the server are related to the CPU; in the server, 1 to 2 CPUs are usually installed, and some CPUs have 4 or more CPUs, and a plurality of CPUs cooperate to complete user service processing. When a fault occurs, serious events such as a server downtime and the like are caused inevitably, and the user service needs to be stopped for troubleshooting and maintenance, which inevitably causes great loss to users.
Disclosure of Invention
In view of the above, the present invention provides a method, a system, an electronic device and a medium for early warning of processor failure. The method aims at early warning before the functional module of the processor really breaks down, and timely handling is carried out before the functional module of the processor really breaks down based on early warning information so as to avoid serious loss caused by handling after the fault happens.
In a first aspect of an embodiment of the present invention, a method for early warning of a processor fault is provided, where the method includes:
performing periodic polling reading on a register in a register matrix according to a preset polling sequence to obtain target data stored in the register;
according to the validity judgment, determining the target data with validity in the register as valid target data and storing the valid target data;
and generating corresponding diagnosis information by diagnosing and analyzing the effective target data, and generating a corresponding early warning log based on the diagnosis information and feeding the early warning log back to a user.
Optionally, the register matrix includes a register capable of performing fault early warning.
Optionally, in the process of performing periodic polling reading on the register in the register matrix according to a preset polling sequence to obtain the target data stored in the register, the method further includes:
a register for recording polling reading in real time;
and after the processor is restarted due to a fault, continuously performing polling reading on each register in the register matrix from the register recorded when the fault occurs according to the polling reading sequence and periodicity, and obtaining target data stored in each register.
Optionally, the determining, according to the validity judgment, target data with validity in the register as valid target data and storing the valid target data includes:
determining a command return code for reading the register, and determining a value of a target byte bit in target data of the register;
under the condition that the values of the command return code and the target byte bit both meet set conditions, determining target data corresponding to the target byte bit in the register as valid target data with validity;
and storing the effective target data in a target storage type.
Optionally, the storing the valid target data in a target storage type includes:
determining a target storage type adapted to a processor from a plurality of preset storage types according to the structural configuration of the processor;
and storing the effective target data in a target storage type.
Optionally, the diagnosing and analyzing the effective target data to generate corresponding diagnosis information and generate a corresponding early warning log based on the diagnosis information, and feeding the early warning log back to the user includes:
determining a functional module corresponding to the effective target data and a processor by performing first diagnosis and analysis on the effective target data;
determining each register having a correlation with the functional module according to the determined functional module;
and performing secondary diagnosis analysis on the effective target data in each register which is relevant to the functional module and the effective target data to generate corresponding diagnosis information, and generating a corresponding early warning log based on the diagnosis information to feed back to a user.
Optionally, the method further comprises:
and according to the diagnosis information, suspending the processing of the processing tasks by the processor or transferring the processing tasks of the processor to a normal unit for processing.
Optionally, the method further comprises:
presetting the priority of a processing task;
and when the diagnosis information indicates that the processor is about to have a fault, transferring the processing tasks with the first priority to a normal unit for processing, and suspending the processing of the processing tasks with the second priority by the processor or continuing to process the processing tasks with the second priority by the processor.
In a second aspect of an embodiment of the present invention, there is provided a system for early warning of processor failure, the system including:
the register polling module is used for periodically polling and reading the registers in the register matrix according to a preset polling sequence to obtain target data stored in the registers;
the register monitoring module is used for determining the target data with validity in the register as valid target data according to validity judgment and storing the valid target data;
the diagnosis analysis module is used for carrying out diagnosis analysis on the effective target data to generate corresponding diagnosis information;
and the early warning reporting module is used for generating a corresponding early warning log based on the diagnosis information and feeding the early warning log back to the user.
In a third aspect of the embodiments of the present invention, there is further provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
the processor is used for implementing the steps of the processor fault early warning method in the first aspect of the invention when executing the program stored in the memory.
In a fourth aspect of the embodiments of the present invention, a computer-readable storage medium is further provided, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the method for early warning of processor failure according to the first aspect of the present invention.
Aiming at the prior art, the invention has the following advantages:
according to the early warning method for the processor fault, provided by the embodiment of the invention, target data stored in a register is obtained by periodically polling and reading the register in a register matrix; determining valid target data with validity from the target data according to validity judgment; and generating corresponding diagnosis information by diagnosing and analyzing the effective target data, and generating a corresponding early warning log based on the diagnosis information and feeding the early warning log back to a user. Therefore, before the processor fails, the data stored in the register is diagnosed and analyzed to obtain corresponding diagnosis information, so that the possible failure of the processor is predicted in advance. And the early warning log result corresponding to the diagnosis information is fed back to the user to inform the user of the possible faults of the processor in the future, and the processor can timely respond before the possible faults of the processor based on the diagnosis information, so that the great loss caused by responding after the processor really fails is effectively avoided.
The above description is only an overview of the technical solutions of the present invention, and the present invention can be implemented in accordance with the content of the description so as to make the technical means of the present invention more clearly understood, and the above and other objects, features, and advantages of the present invention will be more clearly understood.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
Fig. 1 is a flowchart of a method for early warning of processor failure according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a processor fault early warning system according to an embodiment of the present invention;
fig. 3 is a schematic diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings.
Before the invention is explained, the background proposed by the invention is explained, before the functional module of the CPU breaks down, the registers of some modules record the information of functional error, degradation or possible failure of the functional module, and the information belongs to the bottom layer information in the CPU, is used for the relevant scheduling work in the CPU, and does not feed back the bottom layer information of the CPU to the user and the server system; when the above conditions occur, the CPU running function stability may be reduced when the CPU runs for a period of time, the processing task amount is increased, or other external conditions are changed, and if the repair function of the CPU itself and the repair function of external components such as the BIOS are not enough to correct errors and exceptions, the module in which the CPU has errors and exceptions has a high possibility to cause more serious errors, for example, an exception of the IMC module may cause an error in the memory UCE (uncorrectable error) or a storm in the memory CE storm (correctable error storm), and an exception of the UPI module may cause the CPU UCE or CE. In view of this, the present invention provides a functional module for predicting a possible failure of a CPU by performing diagnostic analysis on related failure information recorded in a register extracted from a bottom layer of the CPU, and the functional module of the CPU can respond in time before the functional module of the CPU actually fails, so as to avoid a significant loss caused by responding after the failure actually occurs.
Fig. 1 is a flowchart of a method for early warning of processor failure according to an embodiment of the present invention, as shown in fig. 1, the method includes:
step 101: and performing periodic polling reading on the registers in the register matrix according to a preset polling sequence to obtain target data stored in the registers.
In the invention, the register matrix comprises registers capable of carrying out fault early warning.
In the embodiment of the invention, a large amount of data are stored in the register, but the invention can predict the possible faults of the processor in advance by diagnosing and analyzing the data which are stored in the register and used for performing bottom layer fault early warning on the processor, so that the data which are stored in the register and used for performing bottom layer fault early warning are required, and the data which are stored in the register and used for performing bottom layer fault early warning are target data stored in the register. The registers capable of performing fault early warning include a Core Register, a CHA MSR (coherence and Home Agent) Register, and a Uncore MSR (Uncore Model Specific Register) Register.
In the embodiment of the invention, a processor is provided with a large number of registers of different types, and the invention predicts the possible faults of the processor in advance by carrying out diagnosis and analysis on target data stored in the registers. Therefore, the register matrix is constructed by the register capable of carrying out fault early warning, when the data in the register is read in a periodic polling mode, the data in the register capable of carrying out fault early warning is only read in the periodic polling mode, namely, the data in each register in the constructed register matrix is only read in the periodic polling mode, so that the target data are prevented from being read in the polling mode from all the registers, and the processing efficiency of predicting faults which possibly occur in a processor is improved.
It should be understood that the fault that the processor may possibly have occurred indicates that the processor is not currently in fault, but the relevant data of the underlying fault warning in the register of the processor is analyzed by judgment, so as to predict whether the processor has the possibility of fault in the next period of time, and if so, predict in advance that the processor may have fault.
In the embodiment of the invention, the polling sequence of each register in the register matrix is preset in advance, and each register in the register matrix is periodically polled and read according to the preset polling sequence to acquire the target data stored in each register in the register matrix. The periodic characterization is that after all the registers in the register matrix complete one polling according to a preset polling sequence, and after a preset time interval, new polling reading is continuously performed according to the preset polling sequence. It should be understood that the setting of the preset time period may be set according to an actual application scenario, and is not limited in particular.
Illustratively, the register matrix includes registers a, B, C, D, E, and F, and the preset polling sequence of each register in the register matrix is to poll and read register B, then poll and read register C, then poll and read register D, then poll and read register a, then poll and read register E, and then poll and read register F. And sequentially polling each register in the register matrix according to the preset polling sequence when periodically polling and reading each register in the register matrix in the follow-up process.
Step S102: and determining the target data with validity in the register as valid target data according to validity judgment and storing the valid target data.
In the embodiment of the invention, after the target data of each register in the register matrix is obtained, the target data can be diagnosed and analyzed to predict the possible faults of the processor. In the process, if abnormal target data exists in the target data of each register, diagnosis and analysis are carried out on the basis of the abnormal target data, and an error predicted result is obtained. In order to avoid the situation and obtain a more accurate predicted result, the invention judges the effectiveness of the target data in the register after obtaining the target data of each register in the register matrix, screens out the target data without abnormity, namely the target data with effectiveness from the target data in the register, determines the target data as effective target data, and stores the determined effective target data. Therefore, abnormal target data in the register are filtered out, and effective target data without abnormality are used for diagnosis and analysis to obtain a final predicted result, so that the predicted result is more accurate.
In the embodiment of the invention, after the target data in one register is read and the validity is judged, the next register of the register is read and the validity is judged according to the preset polling sequence until a polling cycle is completed, the valid target data in the polling cycle is diagnosed and analyzed, and then the target data in a new polling cycle is read and the validity is judged, and the diagnosis and analysis are carried out.
It should be understood that the number of valid target data in the target data stored in any register in the register matrix is equal to or less than the number of the target data, and is equal to or greater than zero. Illustratively, the number of target data stored in any register in the register matrix is N, the number of valid target data stored in any register is M, and the relationship between N and M is that N ≧ M ≧ 0.
Step S103: and generating corresponding diagnosis information by diagnosing and analyzing the effective target data, and generating a corresponding early warning log based on the diagnosis information and feeding the early warning log back to a user.
In the embodiment of the invention, after the effective target data in each register in the register matrix is obtained, the effective target data is diagnosed and analyzed, so that corresponding diagnosis information is generated, wherein the diagnosis information comprises a functional module and fault information, which are possible to have faults, of a processor. After the diagnosis information is obtained, a corresponding early warning log is generated according to the diagnosis information and fed back to a user to inform the user of the functional module of which the processor is likely to have a fault, so that the user can timely perform fault troubleshooting on the functional module of which the processor is likely to have a fault before the fault occurs, thereby avoiding the occurrence of the fault and causing great loss.
In the embodiment of the invention, in the process of diagnosing and analyzing the effective target data in each register in the memory matrix, the effective target data in each register is sequentially diagnosed and analyzed according to the preset polling sequence. Illustratively, continuing with the above example, the polling sequence of each register in the register matrix is preset in advance to poll the read register B, then poll the read register C, then poll the read register D, then poll the read register a, then poll the read register E, and then poll the read register F. In the process of diagnosing and analyzing the effective target data in each register in the memory matrix, the effective target data in the analysis register B is diagnosed firstly, then the effective target data in the analysis register C is diagnosed, then the effective target data in the analysis register D is diagnosed, then the effective target data in the analysis register A is diagnosed, then the effective target data in the analysis register E is diagnosed, and then the effective target data in the analysis register F is diagnosed.
In summary, according to the early warning method for processor faults provided by the embodiment of the present invention, the registers in the register matrix are periodically polled and read according to the preset polling order, so as to obtain the target data stored in the registers; determining valid target data with validity from the target data in the register through validity judgment; the corresponding diagnosis information is generated by diagnosing and analyzing the effective target data, and the early warning log corresponding to the diagnosis information is generated according to the obtained diagnosis information and fed back to the user, so that the user can conveniently and timely perform troubleshooting before the failure of the functional module which is likely to have the failure, and the occurrence of the failure and the serious loss are avoided. Therefore, before the processor fails, the corresponding diagnosis information is obtained by diagnosing and analyzing the effective target data stored in the register, so that the possible failure of the processor is predicted in advance, and the early warning log result corresponding to the diagnosis information is fed back to a user to inform the user of the functional module of the processor, which is likely to fail, so that the user can timely troubleshoot the functional module which is likely to fail, and timely response can be made before the processor fails based on the diagnosis information, so that the major loss caused by response made after the processor really fails is effectively avoided. Meanwhile, the invention only carries out periodical polling reading on the data in the register capable of carrying out fault early warning, namely only carries out periodical polling reading on each register in the constructed register matrix, thereby avoiding polling reading target data from all registers, and further improving the processing efficiency of predicting faults which may occur to a processor. Meanwhile, the invention filters abnormal target data in the register by judging the effectiveness of the target data in the register, and uses the effective target data without abnormality for diagnosis and analysis to obtain a final predicted result, thereby ensuring that the predicted result is more accurate.
In the present invention, in the process of periodically polling and reading the registers in the register matrix according to a preset polling sequence to obtain the target data stored in the registers, the method further includes: a register for recording polling reading in real time; and after the processor is restarted due to a fault, continuously performing polling reading on each register in the register matrix from the register recorded when the fault occurs according to the sequence and periodicity of polling reading, and obtaining target data stored in each register.
In the embodiment of the invention, in order to ensure the timeliness of early warning of a fault which may occur in a processor, the invention provides another implementation manner, specifically, in the process of performing periodic polling reading on a register in a register matrix to obtain target data stored in the register, the register read by polling is monitored and recorded in real time, according to the register read by polling recorded when the fault occurs, after the processor is restarted due to a war reset or cold reset fault, the periodic polling reading is continuously performed on each register in the register according to a preset polling sequence and periodicity, starting from the register read by polling which is recorded last, so as to ensure that each register in the register matrix can be correctly read once in one polling period, thereby ensuring the timeliness of early warning.
Illustratively, continuing with the above example, the polling sequence of each register in the register matrix is preset in advance by polling the read register B, then polling the read register C, then polling the read register D, then polling the read register a, then polling the read register E, and then polling the read register F. In one polling cycle, when polling reads the register D, the processor generates AC or DC fault, and then monitors and records that the register finally polled and read in the current polling cycle is the register D. After the processor is restarted due to faults, polling reading is continuously carried out from the register D, then the register A is polled and read, then the register E is polled and the register F is polled and read. Therefore, in the polling period, each register in the register matrix can be correctly read once without omission in the polling period, and the timeliness of early warning is ensured. The method has the advantages that all effective target data which should be contained in the polling period can be contained in the process of fault prediction and early warning based on the effective target data in the polling period, and omission does not exist, so that the final prediction result and early warning result are more accurate.
It should be understood that the predicted result and the early warning result are two results that are associated with each other, for example, by performing analysis and judgment on effective target data, a predicted result is obtained, that is, it is predicted which specific functional module of the processor may be in failure and what kind of failure may occur, a corresponding early warning result is generated based on the predicted result and fed back to the user to early warn the user, and the user is informed of the functional module in which the failure may occur and what kind of failure may occur, so that the user can perform failure troubleshooting on the processor in time before the processor has not failed based on the early warning result, and perform corresponding countermeasures in time, so as to avoid important loss caused by countermeasures only when the failure of the processor occurs.
In the present invention, the determining, according to the validity judgment, target data having validity in the register as valid target data and storing the valid target data includes: determining a command return code for reading the register, and determining a value of a target byte bit in target data of the register; under the condition that the values of the command return code and the target byte bit both meet set conditions, determining target data corresponding to the target byte bit in the register as valid target data with validity; and storing the effective target data in a target storage type.
In the embodiment of the present invention, when the target data in the register is obtained, the command return code of the returned access register is obtained at the same time, and whether the target data in the register is taken by the reading action can be determined by judging the command return code of the access register when the register is read. And the target data of the register has a target byte bit, and the value of the target byte bit will represent whether the target data is valid or not. Therefore, for any register in the read register matrix, the invention determines whether the target data is valid effective target data by judging the command return code when reading the register and the value of the target byte bit of the target data in the register. Specifically, when the command return code represents that the read action obtains the target data in any register, and the value of the target byte bit of the target data represents that the target data is valid, it is determined that both the command return code and the value of the target byte bit satisfy the set condition, and the target data corresponding to the target byte bit is determined to be valid target data. And storing the determined effective target data in a target storage type, and filtering invalid target data. And after the acquisition of the target data of any register and the validity judgment are finished, acquiring the target data in the next register of the any register according to a preset polling sequence and carrying out validity judgment.
Illustratively, when reading the target data in the register, the command return code X corresponding to the reading action will be obtained, and the target data Y in the register will be obtained. If the value of the command return code X represents that the reading action of the register is abnormal, that is, the reading action of the register does not obtain normal target data in the register, and the target data Y is invalid target data at this time. If the value of the command return code X represents that the reading action of the register is normal, namely the reading action of the register is represented to obtain normal target data in the register, the value of a target byte bit of the target data Y is determined at the moment, if the value of the target byte bit of the target data Y represents that the target data is valid, the target data Y is determined to be valid target data, and the determined valid target data Y is stored in a target storage type.
It should be understood that the target byte BIT in the target data of different registers in the register matrix is different, and the target byte BIT in some registers for representing whether the target data is valid is BIT7, and the target byte BIT in some registers for representing whether the target data is valid is BIT63, and the target byte BIT in different registers for representing whether the target data is valid is different. Therefore, the present invention presets the corresponding relationship between the register type and the target byte bit in advance, determines the target byte bit used for representing whether the target data is valid in the target data of the register according to the corresponding relationship when the validity of the target data in the register is judged, and obtains the value in the target byte bit for judging the validity of the target data corresponding to the target byte bit.
In the present invention, the storing the valid target data in a target storage type includes: determining a target storage type adapted to a processor from a plurality of preset storage types according to the structural configuration of the processor; and storing the effective target data in a target storage type.
In the embodiment of the invention, the processors have different structural configurations and different data storage types, so that the early warning method can be suitable for processors with different structural configurations and is capable of presetting a plurality of different target storage types. According to the structural configuration of the processor, a target storage type which is suitable for the structural configuration of the processor is determined from a plurality of preset storage types, and the determined effective target data is stored in the target storage type through an interface of the target storage type. Wherein the storage type at least comprises files and databases in specific formats. If the data is stored in a file, one piece of effective target data comprises the time collected by the register, the register type and the register value; if the data is recorded by a database, a piece of valid target data comprises the recording time of a register, the register type and the register value.
In the present invention, the generating of corresponding diagnosis information by performing diagnosis and analysis on the effective target data and generating a corresponding early warning log based on the diagnosis information to be fed back to a user includes: determining a functional module corresponding to the effective target data and a processor by performing first diagnosis and analysis on the effective target data; determining each register having a correlation with the functional module according to the determined functional module; and performing secondary diagnosis analysis on the effective target data in each register which is relevant to the functional module and the effective target data to generate corresponding diagnosis information, and generating a corresponding early warning log based on the diagnosis information to feed back to a user.
In the embodiment of the present invention, data of each bottom layer fault pre-warning related to the same functional module of the processor may be stored in a plurality of different registers, and data of the bottom layer fault pre-warning in the same register may be related to a plurality of functional modules of the processor, and data of one bottom layer fault pre-warning may be related to a plurality of functional modules of the processor at the same time.
Therefore, in order to improve the accuracy of the predicted result, for the method of the present invention, the effective target data is diagnosed and analyzed to generate corresponding diagnosis information, and a corresponding early warning log is generated based on the diagnosis information and fed back to the user, the present invention provides another embodiment: by performing first diagnosis and analysis on a piece of valid target data, the functional module of the processor corresponding to the valid target data, that is, the functional module of the processor pointed by the piece of valid target data, is determined. Wherein the functional modules of the processor associated with the valid target data include at least one functional module.
After determining the functional module of the processor corresponding to the valid target data, determining a register in the register matrix having a correlation with the functional module. The register having correlation with the functional module refers to a register capable of recording bottom layer fault early warning data of the functional module.
After determining each register in the register matrix, which has correlation with the functional module, performing secondary diagnosis analysis on the effective target data in each register and the effective target data during the primary diagnosis analysis to generate corresponding diagnosis information, and generating a corresponding early warning log based on the diagnosis information to feed back to a user. Therefore, when advance foreknowledge and early warning are carried out, the functional module with faults possibly exists is determined based on one piece of effective target data, all registers storing the bottom layer early warning data of the functional module are found through the functional module, and then the effective target data in the effective target data and all the registers storing the bottom layer early warning data of the functional module are diagnosed and analyzed to obtain the final diagnosis message. Therefore, all effective target data related to the functional module pointed by the effective target data are comprehensively considered, and the finally obtained predicted result and early warning result are more accurate.
For example, some functional modules of the CPU may report the probability of failure occurrence by performing a failure early warning, and distinguish the failure occurrence into a Green level "Green" or "Yellow", where the Green level represents a low probability of failure occurrence and the Yellow represents a high probability of failure occurrence. Taking MLC as an example, the corresponding MC STATUS register records the possibility of failure, and represents a green alarm level when the bit53 value of MC STATUS is 1, and represents a yellow alarm level when the bit54 value of MC STATUS is 1. According to the embodiment of the invention, the diagnosis result of the Green grade generates Warning grade early Warning information and reports the Warning grade early Warning information to the user; and generating Critical-grade early warning information according to the diagnosis result of Yellow, and reporting and feeding back the Critical-grade early warning information to a user.
In the present invention, the method further comprises: and according to the diagnosis information, suspending the processing of the processing tasks by the processor or transferring the processing tasks of the processor to a normal unit for processing.
In the embodiment of the invention, after the functional module which is possibly in fault of the processor is determined through analysis and diagnosis, corresponding diagnosis information is generated. And after the diagnosis information is obtained, generating an early warning log corresponding to the diagnosis information based on the diagnosis information, and reporting and feeding back the early warning log to the user. And controlling the processor which possibly fails to suspend processing of the processing tasks or transferring the processing tasks in the processor which possibly fails to other normal units for processing according to the diagnosis information so as to avoid serious loss caused by true failure of the processor which possibly fails.
In the present invention, the method further comprises: presetting the priority of a processing task; and when the diagnosis information indicates that the processor is about to have a fault, transferring the processing tasks with the first priority to a normal unit for processing, and suspending the processing of the processing tasks with the second priority by the processor or continuing to process the processing tasks with the second priority by the processor.
In the embodiment of the invention, because the number of the processors configured by the server is related to the processing task amount in the normal use process of the server, the server is not configured with excessive processors, which causes the increase of the processing cost, and the task amount which can be processed by each processor in the server is constant. If the invention foresees the possible faults of the processor, the processing tasks are all allocated to the normal unit for processing, which results in the processing load of the normal unit being greatly increased, and this increases the possibility of the normal unit being in fault. In order to avoid the problem, the invention presets priority to the processing tasks of the processor, when predicting the possible faults of the processor but not generating the faults, the processing tasks with the priority of the first priority are transferred to a normal unit for processing, and the processing tasks with the priority of the second priority are continuously processed by the processor which is possible to generate the faults or the processor is suspended from processing the processing tasks with the priority of the second priority. Therefore, only the first priority with the higher priority is transferred to the normal unit to be processed, and the phenomenon that excessive processing tasks of the processor which possibly fails are transferred to the normal unit can be avoided, so that the workload increasing degree of the normal unit is reduced, the possibility of failure of the normal unit is reduced, and meanwhile, important processing tasks cannot be influenced and can be continuously processed normally.
In the embodiment of the invention, the number of the first priority can be set according to the load capacity of the normal unit. When the load capacity of the normal unit is larger, the number included in the first priority is set to be larger, so that the processors which are likely to have faults can have more processing tasks to be transferred to the normal unit; when the load capacity of the normal unit is smaller, the number included in the first priority is set to be smaller, so that the processors which may fail can have fewer processing tasks transferred to the normal unit, and the processing load of the normal unit is prevented from being too large. By the implementation method, the number of the processing tasks in the processor which possibly have faults to be transferred to the normal unit can be dynamically adjusted according to the processing capacity of the normal unit, so that the utilization rate of the processor is improved. For example, the priorities are divided into 4 levels, when a normal unit has higher load capacity, the first 3 high-priority processing tasks can be all transferred to the normal unit for processing, at this time, the first 3 high priorities are collectively referred to as first priorities, and the remaining 1 lowest priority is a second priority; when the normal unit has a low load capacity, only the processing tasks with the highest 1 priority can be transferred to the normal unit for processing, at this time, the highest 1 priority is the first priority, and the remaining 3 low priorities are the second priorities.
In the embodiment of the present invention, according to the order of the priority levels, the first priority level includes at least the highest priority level, the second priority level includes at least the lowest priority level, and the first priority level and the second priority level include all preset priority levels. For example, the preset priorities include priority 1, priority 2, priority 3, and priority 4 in descending order of priority. When the first priority comprises priority 1 and priority 2, the second priority comprises priority 3 and priority 4; when the first priority includes only priority 1, the second priority includes priority 2, priority 3, and priority 4.
Fig. 2 is a schematic diagram of a processor failure early warning system according to an embodiment of the present invention, and as shown in fig. 2, the system 200 includes:
the register polling module 201 is configured to perform periodic polling reading on a register in a register matrix according to a preset polling sequence, and obtain target data stored in the register;
the register monitoring module 202 is configured to determine, according to validity judgment, target data with validity in the register as valid target data and store the valid target data;
the diagnosis analysis module 203 is used for performing diagnosis analysis on the effective target data to generate corresponding diagnosis information;
and the early warning reporting module 204 is configured to generate a corresponding early warning log based on the diagnosis information and feed the early warning log back to the user.
In the embodiment of the present invention, the system 200 includes a register monitoring module 202, a diagnostic analysis module, and an early warning reporting module 203, where the register monitoring module 202 includes a register polling module 201 and a register matrix. The register monitoring module 202 runs in parallel throughout the life cycle of the server, the register polling module 201 in the register monitoring module 202 is responsible for acquiring target data in a register capable of performing fault early warning, and the register monitoring module 202 performs validity judgment on the collected target data, so that valid target data is determined from the target data, and the valid data is transmitted to the diagnosis and analysis module 203. The diagnosis and analysis module 203 determines the warning level of the early warning information by judging and analyzing the received valid target data information, and transmits the information of the functional module and the fault information which may have faults to the processor to the early warning reporting module 204. The early warning reporting module 204 is responsible for processing reporting and presentation of early warning information, according to the warning level of the early warning information, the early warning reporting module 204 generates a log of the diagnosis information, a more serious warning immediately generates the log and transmits the log to a log system, the log is displayed on a web page, and a lower level warning is directly recorded in the log. The register polling module 201 includes an encapsulated access method of an original interface for accessing a register in a processor, where a preset polling sequence and period for polling and reading the register are set.
Optionally, the register matrix in the register polling module 201 includes registers capable of performing fault early warning.
Optionally, the system 200 further comprises:
the real-time recording module is used for recording the polling read register in real time;
and the register polling module is used for continuously polling and reading each register in the register matrix from the register recorded when the processor fails according to the polling and reading sequence and periodicity after the processor is restarted due to failure, so as to obtain the target data stored in each register.
In this embodiment of the present invention, the system 200 further includes a real-time recording module, configured to record, in real time, the registers read by polling when periodically polling each register in the register matrix. After the processor is restarted due to a fault, the register polling module 201 acquires the registers recorded when the fault occurs from the real-time recording module according to the polling reading sequence and periodicity, and continuously performs polling reading on each register in the register matrix from the registers to acquire the target data stored in each register, so that the timeliness of early warning is ensured.
Optionally, the register monitoring module 202 includes:
the first register monitoring module is used for determining a command return code for reading the register and determining the value of a target byte bit in target data of the register;
the second register monitoring module is used for determining target data corresponding to the target byte bit in the register as valid target data with validity under the condition that the values of the command return code and the target byte bit both meet set conditions;
and the data storage module is used for storing the effective target data in a target storage type.
Optionally, the data storage module includes:
the target storage type determining module is used for determining a target storage type adaptive to the processor from a plurality of preset storage types according to the structural configuration of the processor;
and the data storage submodule is used for storing the effective target data in a target storage type.
In the embodiment of the present invention, multiple storage types for storing valid target data are preset according to different structural configurations of a processor, and a target storage type adapted to the structural configuration of the processor is determined from the multiple preset storage types by a target storage type determining module, so as to store the valid target data.
Optionally, the diagnosis parsing module 203 and the early warning reporting module 204 include:
the first diagnosis analysis module is used for determining a functional module corresponding to the effective target data and the processor by performing first diagnosis analysis on the effective target data;
the second diagnosis analysis module is used for determining each register which is relevant to the functional module according to the determined functional module;
the third diagnosis analysis module is used for carrying out secondary diagnosis analysis on the effective target data in each register which is relevant to the functional module and the effective target data to generate corresponding diagnosis information;
and the early warning reporting submodule is used for generating a corresponding early warning log based on the diagnosis information and feeding the early warning log back to the user.
Optionally, the system 200 further comprises:
and the first processing task allocation module is used for suspending the processing of the processing tasks by the processor or transferring the processing tasks of the processor to a normal unit for processing according to the diagnosis information.
Optionally, the system 200 further comprises:
the priority setting module is used for presetting the priority of the processing task;
and the second processing task allocation module is used for transferring the processing tasks with the first priority to the normal unit for processing when the diagnosis information indicates that the processor is about to have a fault, and suspending the processing of the processing tasks with the second priority by the processor or continuing the processing of the processing tasks with the second priority by the processor.
The embodiment of the present invention further provides an electronic device, as shown in fig. 3, including a processor 301, a communication interface 302, a memory 303 and a communication bus 304, where the processor 301, the communication interface 302 and the memory 303 complete mutual communication through the communication bus 304;
a memory 303 for storing a computer program;
the processor 301 is configured to implement the steps of the method for early warning of processor failure according to the present invention when executing the program stored in the memory 303.
The embodiment of the invention also provides a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and when the computer program is executed by a processor, the computer readable storage medium realizes the early warning method for the processor fault provided by the invention.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (11)

1. A processor fault early warning method is characterized by comprising the following steps:
performing periodic polling reading on a register in a register matrix according to a preset polling sequence to obtain target data stored in the register;
according to the validity judgment, determining the target data with validity in the register as valid target data and storing the valid target data;
and generating corresponding diagnosis information by diagnosing and analyzing the effective target data, and generating a corresponding early warning log based on the diagnosis information and feeding the early warning log back to a user.
2. The method of claim 1, wherein the register matrix comprises registers capable of performing fault early warning.
3. The method for early warning of processor faults according to claim 1, wherein in the process of periodically polling and reading the registers in the register matrix according to a preset polling sequence to obtain the target data stored in the registers, the method further comprises:
a register for recording polling reading in real time;
and after the processor is restarted due to a fault, continuously performing polling reading on each register in the register matrix from the register recorded when the fault occurs according to the sequence and periodicity of polling reading, and obtaining target data stored in each register.
4. The method for early warning of processor faults according to claim 1, wherein the step of determining and storing the target data with validity in the register as valid target data according to the validity judgment comprises the following steps:
determining a command return code for reading the register, and determining a value of a target byte bit in target data of the register;
under the condition that the values of the command return code and the target byte bit both meet set conditions, determining target data corresponding to the target byte bit in the register as valid target data with validity;
and storing the effective target data in a target storage type.
5. The method of claim 4, wherein the storing the valid target data in a target storage type comprises:
determining a target storage type adapted to a processor from a plurality of preset storage types according to the structural configuration of the processor;
and storing the effective target data in a target storage type.
6. The method for early warning of processor faults according to claim 1, wherein the generating corresponding diagnosis information by performing diagnosis and analysis on the effective target data and generating corresponding warning logs based on the diagnosis information to be fed back to a user comprises:
determining a functional module corresponding to the effective target data and a processor by performing first diagnosis and analysis on the effective target data;
determining each register having a correlation with the functional module according to the determined functional module;
and performing secondary diagnosis analysis on the effective target data in each register which is relevant to the functional module and the effective target data to generate corresponding diagnosis information, and generating a corresponding early warning log based on the diagnosis information to feed back to a user.
7. The method of claim 1, wherein the method further comprises:
and according to the diagnosis information, suspending the processing of the processing tasks by the processor or transferring the processing tasks of the processor to a normal unit for processing.
8. The method of claim 7, wherein the method further comprises:
presetting the priority of a processing task;
and when the diagnosis information indicates that the processor is about to have a fault, transferring the processing tasks with the first priority to a normal unit for processing, and suspending the processing of the processing tasks with the second priority by the processor or continuing to process the processing tasks with the second priority by the processor.
9. A processor fault early warning system, the system comprising:
the register polling module is used for periodically polling and reading a register in a register matrix according to a preset polling sequence to obtain target data stored in the register;
the register monitoring module is used for determining the target data with validity in the register as valid target data according to validity judgment and storing the valid target data;
the diagnosis analysis module is used for carrying out diagnosis analysis on the effective target data to generate corresponding diagnosis information;
and the early warning reporting module is used for generating a corresponding early warning log based on the diagnosis information and feeding the log back to a user.
10. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the steps of a method for early warning of processor failure as claimed in any one of claims 1 to 8 when executing a program stored in a memory.
11. A computer-readable storage medium, on which a computer program is stored, which program, when executed by a processor, implements a method for early warning of processor failure as claimed in any one of claims 1 to 8.
CN202211181971.1A 2022-09-27 2022-09-27 Processor fault early warning method, system, electronic equipment and medium Pending CN115686890A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211181971.1A CN115686890A (en) 2022-09-27 2022-09-27 Processor fault early warning method, system, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211181971.1A CN115686890A (en) 2022-09-27 2022-09-27 Processor fault early warning method, system, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN115686890A true CN115686890A (en) 2023-02-03

Family

ID=85062769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211181971.1A Pending CN115686890A (en) 2022-09-27 2022-09-27 Processor fault early warning method, system, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN115686890A (en)

Similar Documents

Publication Publication Date Title
Zheng et al. Co-analysis of RAS log and job log on Blue Gene/P
CN110825578A (en) Method for automatically managing hardware error events occurring in a data center system
KR101476081B1 (en) Network event management
EP2523115A1 (en) Operation management device, operation management method, and program storage medium
JP4573179B2 (en) Performance load abnormality detection system, performance load abnormality detection method, and program
JP2005327261A (en) Performance monitoring device, performance monitoring method and program
EP3591485B1 (en) Method and device for monitoring for equipment failure
CN113836044B (en) Method and system for collecting and analyzing software faults
US20060031487A1 (en) Apparatuses for generation and collection of information, and computer-readable recording medium
CN109165138A (en) A kind of method and apparatus of monitoring equipment fault
CN109901969B (en) Design method and device of centralized monitoring management platform
CN111552556A (en) GPU cluster service management system and method
CN113590429A (en) Server fault diagnosis method and device and electronic equipment
CN114924929A (en) NVMe hard disk fault early warning method, system and computer equipment
CN108899059B (en) Detection method and equipment for solid state disk
US7206975B1 (en) Internal product fault monitoring apparatus and method
CN115686890A (en) Processor fault early warning method, system, electronic equipment and medium
CN116560893A (en) Computer application program operation data fault processing system
US20040024659A1 (en) Method and apparatus for integrating server management and parts supply tools
US20230335269A1 (en) Splitting and ordering based log file transfer for medical systems
JP2020035297A (en) Apparatus state monitor and program
CN115543665A (en) Memory reliability evaluation method and device and storage medium
CN104794040A (en) Service monitoring method, device and system
US20140361978A1 (en) Portable computer monitoring
CN113342596A (en) Distributed monitoring method, system and device for equipment indexes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination