CN117973276A

CN117973276A - Chip fault recognition and processing method, device, computer equipment and storage medium

Info

Publication number: CN117973276A
Application number: CN202311833216.1A
Authority: CN
Inventors: 向柏澄; 习伟; 陈军健; 陶伟; 张巧惠; 关志华; 董飞龙; 谢心昊; 孙沁; 张泽林
Original assignee: Southern Power Grid Digital Grid Research Institute Co Ltd
Current assignee: Southern Power Grid Digital Grid Research Institute Co Ltd
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-05-03

Abstract

The present application relates to a chip fault recognition and processing method, apparatus, computer device, storage medium and computer program product. The method comprises the following steps: obtaining structure parameter information corresponding to each component structure of the chip, data interaction mode information among the component structures and current state information of the chip; generating a digital twin model for the chip according to the structure parameter information corresponding to each component structure and the data interaction mode information among each component structure; operating the current state information through a digital twin model to obtain operation process simulation data corresponding to the chip in a target time range; determining a fault structure and fault information corresponding to the fault structure in each component structure of the chip based on the running process simulation data; and determining a fault protection and control strategy for each fault structure of the chip according to the fault information corresponding to each fault structure. The method can improve the fault processing efficiency of the chip.

Description

Chip fault recognition and processing method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of integrated circuit chip technology, and in particular, to a chip fault recognition and processing method, apparatus, computer device, storage medium, and computer program product.

Background

As the data processing amount and the data processing variety of the integrated circuit chip are increased, the data amount processed by the chip in unit time is also increased.

When the chip is in a high-efficiency running state for a long time, part of the structure often fails to work, so that the normal running of the chip is influenced, and the running timeliness of the chip is further influenced. At present, when a chip is subjected to fault detection, the operation parameters of the operation of the chip are usually monitored in real time, when faults are generated, the fault points of the chip are removed by a manual method, and the fault problems of the fault points are repaired by the manual method. However, the method for manually performing fault identification and fault restoration is low in corresponding fault identification accuracy and corresponding fault processing efficiency.

Therefore, the conventional technology has a problem of low efficiency of fault handling of the chip.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a chip failure recognition and processing method, apparatus, computer device, computer readable storage medium, and computer program product that can improve failure processing efficiency for a chip.

A method for chip fault identification and handling, comprising:

Obtaining structure parameter information corresponding to each component structure of the chip, data interaction mode information among the component structures and current state information of the chip;

Generating a digital twin model for the chip according to the structure parameter information corresponding to each component structure and the data interaction mode information among each component structure;

operating the current state information through a digital twin model to obtain operation process simulation data corresponding to the chip in a target time range;

determining a fault structure and fault information corresponding to the fault structure in each component structure of the chip based on the running process simulation data;

And determining a fault protection and control strategy for each fault structure of the chip according to the fault information corresponding to each fault structure.

In one embodiment, generating a digital twin model for a chip according to structure parameter information corresponding to each component structure and data interaction mode information between each component structure includes:

Generating a structure model corresponding to each component structure according to the structure parameter information corresponding to each component structure;

Determining a data interaction strategy between each component structure and a connection mode between each component structure according to the data interaction mode information between each component structure;

and connecting the structure models corresponding to the component structures according to the data interaction strategy among the component structures and the connection mode among the component structures to obtain the digital twin model for the chip.

In one embodiment, the obtaining the running process simulation data corresponding to the chip in the target time range by running the current state information through the digital twin model includes:

operating the current state data through a digital twin model to obtain sub-operation process simulation data obtained by the operation of the chip in each sub-operation process in a target time range; each sub-operation process is obtained by dividing the time length of the operation process according to the time sequence; sub-operation process simulation data obtained by operation of any sub-operation process in all sub-operation processes are used for operation of the next sub-operation process of any sub-operation process;

And connecting the sub-operation process simulation data obtained by operation in each sub-operation process according to the time sequence to obtain the operation process simulation data corresponding to the chip in the target time range.

In one embodiment, determining a fault structure and fault information corresponding to the fault structure in each component structure of the chip based on the running process simulation data includes:

dividing the running process simulation data into structure running process simulation data corresponding to each component structure, and determining running data distribution information corresponding to each component structure according to the structure running process simulation data corresponding to each component structure;

Acquiring a normal operation data distribution range corresponding to each component structure in normal operation, and determining a component structure conforming to a fault structure judgment condition according to operation data distribution information and the normal operation data distribution range corresponding to each component structure;

And taking the component structure meeting the fault structure judging condition as a fault structure, and determining abnormal operation condition information of the fault structure according to the operation data distribution information corresponding to the fault structure as fault information corresponding to the fault structure.

In one embodiment, determining a fault protection policy for each fault structure of the chip according to fault information corresponding to each fault structure includes:

Based on the fault information corresponding to each fault structure, identifying the fault reason corresponding to each fault structure;

And acquiring current state information corresponding to each fault structure, and determining a fault protection and control strategy for each fault structure according to the fault reasons and the current state information corresponding to each fault structure.

In one embodiment, obtaining current state information corresponding to each fault structure includes:

acquiring current structure parameter information corresponding to each fault structure, and determining current running state information and current structure state information of each fault structure based on the current structure parameter information corresponding to each fault structure;

And determining the current state information corresponding to each fault structure according to the current running state information and the current structure state information of each fault structure.

A chip fault recognition and processing device, comprising:

the acquisition module is used for acquiring structure parameter information corresponding to each component structure of the chip, data interaction mode information among each component structure and current state information of the chip;

the generating module is used for generating a digital twin model for the chip according to the structure parameter information corresponding to each component structure and the data interaction mode information among the component structures;

The simulation module is used for obtaining the corresponding running process simulation data of the chip in the target time range by running the current state information of the digital twin model;

the identification module is used for determining a fault structure and fault information corresponding to the fault structure in each component structure of the chip based on the running process simulation data;

The determining module is used for determining fault protection and control strategies aiming at the fault structures of the chip according to the fault information corresponding to the fault structures.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the method described above when the processor executes the computer program.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method described above.

A computer program product comprising a computer program which, when executed by a processor, implements the steps of the method described above.

The chip fault recognition and processing method, the device, the computer equipment, the storage medium and the computer program product are realized by acquiring the structure parameter information corresponding to each component structure of the chip, the data interaction mode information among each component structure and the current state information of the chip; generating a digital twin model for the chip according to the structure parameter information corresponding to each component structure and the data interaction mode information among each component structure; operating the current state information through a digital twin model to obtain operation process simulation data corresponding to the chip in a target time range; determining a fault structure and fault information corresponding to the fault structure in each component structure of the chip based on the running process simulation data; determining a fault protection and control strategy for each fault structure of the chip according to the fault information corresponding to each fault structure; therefore, a digital twin model which can be used for predicting the running process condition of the chip in the target time range can be generated through a digital twin technology, running process simulation data which is close to the actual running condition of the chip can be obtained, the structure which possibly fails to the chip can be accurately identified, the failure detection efficiency of the chip is improved, a corresponding failure prevention and control strategy is further generated aiming at the structure which possibly fails to the chip, the problem that the processing efficiency of failure processing is low when the failure occurs is avoided, and the failure processing efficiency of the chip is improved.

Drawings

FIG. 1 is an application environment diagram of a method for chip failure recognition and handling in one embodiment;

FIG. 2 is a flow chart of a method for identifying and handling chip failures in one embodiment;

FIG. 3 is a flow chart of a method for identifying and handling chip failures in another embodiment;

FIG. 4 is a block diagram of a chip failure recognition and processing device in one embodiment;

Fig. 5 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure.

The chip fault recognition and processing method provided by the embodiment of the application can be applied to application environments of the Internet of things chip and the integrated circuit chip, and can also be applied to the terminal 102 or the server 102 in the application system shown in fig. 1. In fig. 1, a terminal 102 communicates with a server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The terminal 102 acquires structure parameter information corresponding to each component structure of the chip, data interaction mode information among each component structure and current state information of the chip; the terminal 102 generates a digital twin model for the chip according to the structure parameter information corresponding to each component structure and the data interaction mode information among each component structure; the terminal 102 operates the current state information through a digital twin model to obtain operation process simulation data corresponding to the chip in a target time range; the terminal 102 determines a fault structure and fault information corresponding to the fault structure in each component structure of the chip based on the running process simulation data; the terminal 102 determines a failure protection policy for each failure structure of the chip according to the failure information corresponding to each failure structure. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.

In one embodiment, as shown in fig. 2, a method for identifying and processing a chip fault is provided, and the method is applied to the terminal 102 in fig. 1 for illustration, and includes the following steps:

step S202, obtaining structure parameter information corresponding to each component structure of the chip, data interaction mode information among each component structure and current state information of the chip.

The constituent structures may be respective structures constituting a chip, for example, a logic structure, a register structure, and a control structure.

The structure parameter information may be a structure parameter corresponding to each structure of the chip, for example, the structure parameter information may be a logic structure parameter, a register structure parameter, and a control structure parameter.

The data interaction mode information may be a data interaction mode between each structure of the chip.

The current state information may be operation data information corresponding to the chip in the current operation state.

In the specific implementation, the terminal acquires the corresponding structural parameters of each component structure of the chip, the data interaction mode among each component structure and the current state information of the chip.

Step S204, a digital twin model aiming at the chip is generated according to the structure parameter information corresponding to each component structure and the data interaction mode information among the component structures.

The digital twin model can be a simulation model of a chip generated based on a digital twin strategy, and can also be used as a structural model of the chip for simulating the operation process of the chip so as to acquire operation process simulation data of the chip.

In the specific implementation, the terminal generates a digital twin model capable of simulating the internal structure of the chip according to the corresponding structural parameters of each component structure and the data interaction mode among the component structures.

Step S206, running current state information through the digital twin model to obtain corresponding running process simulation data of the chip in a target time range.

The running process simulation data may be simulation data of the chip in the running process after simulating the running process of the chip.

In the specific implementation, the terminal operates the current state information through the digital twin model to obtain the corresponding operation process simulation data of the chip in the target time range.

Step S208, determining a fault structure and fault information corresponding to the fault structure in each component structure of the chip based on the running process simulation data.

The fault structure may be a predicted composition structure that may cause a fault or be abnormal.

The fault information may be operation progress information of a fault or abnormal period corresponding to a predicted composition structure that may generate a fault or abnormal.

In specific implementation, the terminal identifies structures which are likely to generate faults in all component structures of the chip based on the simulation data of the running process, and the structures are taken as fault structures, and identifies running process information corresponding to a fault period generated by the fault structures in the running process, and the fault information is taken as fault information.

Step S210, determining a fault protection strategy for each fault structure of the chip according to the fault information corresponding to each fault structure.

The fault protection policy may be a processing policy for solving a fault cause of the fault structure.

In specific implementation, the terminal determines a fault protection and control strategy for each fault structure of the chip according to the fault information corresponding to each fault structure.

In practical application, after obtaining the fault control policies for each fault structure of the chip, each fault control policy needs to be sent to the display port, so that service personnel can perform corresponding fault control on the fault structure according to the fault control policies displayed by the display port.

In the chip fault recognition and processing method, the structure parameter information corresponding to each component structure of the chip, the data interaction mode information among each component structure and the current state information of the chip are obtained; generating a digital twin model for the chip according to the structure parameter information corresponding to each component structure and the data interaction mode information among each component structure; operating the current state information through a digital twin model to obtain operation process simulation data corresponding to the chip in a target time range; determining a fault structure and fault information corresponding to the fault structure in each component structure of the chip based on the running process simulation data; determining a fault protection and control strategy for each fault structure of the chip according to the fault information corresponding to each fault structure; therefore, a digital twin model which can be used for predicting the running process condition of the chip in the target time range can be generated through a digital twin technology, running process simulation data which is close to the actual running condition of the chip can be obtained, the structure which possibly fails to the chip can be accurately identified, the failure detection efficiency of the chip is improved, a corresponding failure prevention and control strategy is further generated aiming at the structure which possibly fails to the chip, the problem that the processing efficiency of failure processing is low when the failure occurs is avoided, and the failure processing efficiency of the chip is improved.

In another embodiment, generating a digital twin model for a chip according to structure parameter information corresponding to each component structure and data interaction mode information between each component structure includes: generating a structure model corresponding to each component structure according to the structure parameter information corresponding to each component structure; determining a data interaction strategy between each component structure and a connection mode between each component structure according to the data interaction mode information between each component structure; and connecting the structure models corresponding to the component structures according to the data interaction strategy among the component structures and the connection mode among the component structures to obtain the digital twin model for the chip.

The structural model can be a corresponding component structure model of any component structure of the simulation chip.

The data interaction policy may include a data transmission mode and a data transmission policy.

The connection mode can represent the connection relation among all the component structures in the chip.

In the specific implementation, the terminal generates a structure model corresponding to each component structure according to the structure parameters corresponding to each component structure, then the terminal determines a data interaction strategy between each component structure and a connection mode between each component structure according to the data interaction mode between each component structure, then the terminal connects the structure models corresponding to each component structure according to the connection mode between each component structure to obtain an initial operation model corresponding to the chip, and then the terminal adds the data interaction strategy between each component structure into the initial operation model of the chip to obtain a digital twin model corresponding to the chip.

According to the technical scheme of the embodiment, the structural model corresponding to each component structure is generated according to the structural parameter information corresponding to each component structure; determining a data interaction strategy between each component structure and a connection mode between each component structure according to the data interaction mode information between each component structure; according to the data interaction strategy among the component structures and the connection mode among the component structures, connecting the structure models corresponding to the component structures to obtain a digital twin model for the chip; therefore, a digital twin model which can be used for carrying out operation simulation on the chip can be constructed by establishing the sub operation model of each component structure and identifying the connection mode and the data interaction strategy of each component structure, and the simulation degree of the operation model of the chip is improved.

In another embodiment, running current state information through a digital twin model to obtain running process simulation data corresponding to a chip in a target time range, including: operating the current state data through a digital twin model to obtain sub-operation process simulation data obtained by the operation of the chip in each sub-operation process in a target time range; each sub-operation process is obtained by dividing the time length of the operation process according to the time sequence; sub-operation process simulation data obtained by operation of any sub-operation process in all sub-operation processes are used for operation of the next sub-operation process of any sub-operation process; and connecting the sub-operation process simulation data obtained by operation in each sub-operation process according to the time sequence to obtain the operation process simulation data corresponding to the chip in the target time range.

The sub-running process may be a running process of a unit time period when the chip executes the data processing task.

The simulation data of the sub-running process can be the simulation data of a chip obtained by running the sub-running process. That is, the sub-run process simulation data may be simulation data of a run process per unit time period when the chip performs a data processing task.

In the specific implementation, a terminal operates the current state information of a chip based on the digital twin model of the chip to obtain first sub-operation process simulation data which are obtained by the operation of a first sub-operation process of the chip aiming at the current state information, then the terminal operates the first sub-operation process simulation data as operation data of a second sub-operation process of the chip in the digital twin model of the chip to obtain second sub-operation process simulation data, then the terminal operates the second sub-operation process simulation data as operation data of a third sub-operation process of the chip in the digital twin model of the chip to obtain third sub-operation process simulation data, the processes are repeated until sub-operation process simulation data corresponding to all sub-operation processes of the chip are obtained, and finally the terminal performs sequencing connection on all the sub-operation process simulation data according to a time sequence to obtain the operation process simulation data corresponding to the chip in a target time range. The sub-running processes are obtained by dividing the time length of the running process according to the time sequence.

In practical application, the duration between the ending time point of the last sub-running process and the starting time point of the running process is equal to the duration corresponding to the target time range.

According to the technical scheme, the current state data is operated through a digital twin model, and sub-operation process simulation data obtained by operation of the chip in each sub-operation process in a target time range are obtained; each sub-operation process is obtained by dividing the time length of the operation process according to the time sequence; sub-operation process simulation data obtained by operation of any sub-operation process in all sub-operation processes are used for operation of the next sub-operation process of any sub-operation process; according to the time sequence, sub-operation process simulation data obtained by operation in each sub-operation process are connected to obtain operation process simulation data corresponding to the chip in a target time range; therefore, simulation data corresponding to each sub-operation process in the target time range can be obtained through the gradient prediction method, and accuracy of obtaining the operation process simulation data of the chip is improved.

In another embodiment, determining a fault structure and fault information corresponding to the fault structure in each component structure of the chip based on the running process simulation data includes: dividing the running process simulation data into structure running process simulation data corresponding to each component structure, and determining running data distribution information corresponding to each component structure according to the structure running process simulation data corresponding to each component structure; acquiring a normal operation data distribution range corresponding to each component structure in normal operation, and determining a component structure conforming to a fault structure judgment condition according to operation data distribution information and the normal operation data distribution range corresponding to each component structure; and taking the component structure meeting the fault structure judging condition as a fault structure, and determining abnormal operation condition information of the fault structure according to the operation data distribution information corresponding to the fault structure as fault information corresponding to the fault structure.

The structure running process simulation data may refer to simulation data corresponding to any component structure of the chip in the chip running process.

The operation data distribution information may refer to distribution information of operation data corresponding to a data distribution coordinate system with time as an abscissa and operation data as an ordinate.

The normal operation data distribution range may refer to an operation data distribution range corresponding to the composition structure in a normal operation state.

The failure structure determination condition may be a determination condition for determining whether or not there is a failure in the constituent structure of the chip.

The abnormal operation condition information may refer to operation data distribution information which corresponds to the composition structure of the chip and does not belong to the normal operation data distribution range during abnormal operation.

In the specific implementation, the terminal divides the operation process simulation data into structure operation process simulation data corresponding to each component structure, identifies operation data distribution information of each component structure in a data distribution coordinate system based on the structure operation process simulation data corresponding to each component structure, acquires a normal operation data distribution range corresponding to each component structure in normal operation, determines the proportion of operation data, which does not belong to the normal operation data distribution range, of the component structure to the total operation data for any component structure, takes the component structure with the proportion larger than a preset proportion threshold value as a fault structure, and determines abnormal operation condition information in the operation data distribution information corresponding to the fault structure based on the operation data distribution information and the normal operation data distribution range corresponding to the fault structure for any fault structure as fault information corresponding to the fault structure.

In practical application, for any fault structure, the terminal identifies the starting point and duration of the abnormal condition in the abnormal operation condition information, and takes the starting point and duration and the abnormal operation condition information as the fault information of the fault structure.

According to the technical scheme of the embodiment, the running process simulation data are divided into structure running process simulation data corresponding to each component structure, and running data distribution information corresponding to each component structure is determined according to the structure running process simulation data corresponding to each component structure; acquiring a normal operation data distribution range corresponding to each component structure in normal operation, and determining a component structure conforming to a fault structure judgment condition according to operation data distribution information and the normal operation data distribution range corresponding to each component structure; taking a component structure meeting the fault structure judging conditions as a fault structure, and determining abnormal operation condition information of the fault structure according to operation data distribution information corresponding to the fault structure as fault information corresponding to the fault structure; therefore, whether the distribution range of the operation data corresponding to each component structure is abnormal or not can be accurately identified, so that the fault structure and the corresponding fault information are identified, and the identification efficiency and the identification accuracy of the fault structure and the corresponding fault information are improved.

In another embodiment, determining a fault protection policy for each fault structure of the chip according to fault information corresponding to each fault structure includes: based on the fault information corresponding to each fault structure, identifying the fault reason corresponding to each fault structure; and acquiring current state information corresponding to each fault structure, and determining a fault protection and control strategy for each fault structure according to the fault reasons and the current state information corresponding to each fault structure.

The fault reason may be a reason that causes the fault structure to fail, for example, the fault reason may refer to hardware aging, high hardware temperature, long-time high-frequency operation, incorrect operation data, incorrect data reception, and the like.

The current state information may be a program running state in a running process of the fault structure.

In the specific implementation, for each fault structure, the terminal identifies the fault reason of the fault structure according to the fault information corresponding to the fault structure, then the terminal acquires the current state information corresponding to each fault structure, and determines the fault protection and control strategy corresponding to the fault structure according to the fault reason and the current state information corresponding to each fault structure.

In practical application, for each fault structure, the terminal determines the fault type corresponding to the fault cause of each fault structure according to the fault information corresponding to each fault structure, and determines the starting time point and the duration of the abnormality of each fault structure according to each fault information, thereby determining the adjustment time limit corresponding to each fault structure. Then, under the condition that the fault type corresponding to the fault structure is determined to be a structural fault, the terminal identifies a fault point corresponding to the fault structure based on the fault information corresponding to the fault structure, queries a fault processing strategy corresponding to the fault point in a database based on a fault reason corresponding to the fault structure, and takes the fault point, the fault processing strategy and the adjustment time limit corresponding to the fault structure as a fault protection strategy of the fault structure; or the terminal inquires an operation adjustment strategy corresponding to the operation fault in the database based on the fault reason corresponding to the fault structure under the condition that the fault type corresponding to the fault structure is determined to be the operation fault, and takes the operation adjustment strategy and the adjustment time limit as a fault protection and control strategy of the fault structure. By adjusting the strategy and the time limit according to different fault types, the proper fault protection and control strategy can be determined for each fault structure.

The operation adjustment strategy is a processing strategy for solving the operation fault, and the operation fault processing strategy is a processing mode of different operation faults which is preset in a terminal and is obtained by summarizing a plurality of dimension information such as a large amount of working experience, expert experience and the like.

According to the technical scheme, fault reasons corresponding to the fault structures are identified based on the fault information corresponding to the fault structures; acquiring current state information corresponding to each fault structure, and determining a fault protection and control strategy for each fault structure according to fault reasons and the current state information corresponding to each fault structure; therefore, the fault control strategy for each fault structure can be determined according to the fault reasons and the current state information corresponding to different fault structures, and the possible fault composition structure of the chip can be comprehensively controlled in advance.

In another embodiment, obtaining current state information corresponding to each fault structure includes: acquiring current structure parameter information corresponding to each fault structure, and determining current running state information and current structure state information of each fault structure based on the current structure parameter information corresponding to each fault structure; and determining the current state information corresponding to each fault structure according to the current running state information and the current structure state information of each fault structure.

The current structure parameter information may refer to an operation state parameter and a structure state parameter corresponding to the fault structure in the current operation.

The current running state information may refer to a current running state of the program of the fault structure.

The current structure state information may refer to a current state of structural hardware of the fault structure.

In the specific implementation, the terminal collects current structure parameter information corresponding to each fault structure, determines the current running state and structure of each fault structure based on the current structure parameter information, and takes the running state and structure state of each fault structure as the current state information of each fault structure.

According to the technical scheme of the embodiment, the current running state information and the current structure state information of each fault structure are determined based on the current structure parameter information corresponding to each fault structure by acquiring the current structure parameter information corresponding to each fault structure; determining the current state information corresponding to each fault structure according to the current running state information and the current structure state information of each fault structure; therefore, the current state of the fault structure can be determined by determining the running state and the structure state of the fault structure, and the accuracy of determining the current state of the fault structure is improved.

For the convenience of understanding of those skilled in the art, the following further exemplarily provides a specific method for identifying a failure cause of a failure structure with respect to the above-described "failure information based on failure structure correspondence", including:

Step 1: for any composition structure, inquiring the distribution state of the operation data corresponding to each fault cause appearing in the composition structure history through a database;

Step 2: for any fault structure, identifying an initial fault reason corresponding to the fault structure based on operation data distribution information corresponding to each fault reason appearing in each component structure history and abnormal operation data distribution information corresponding to the fault structure;

Step 3: for any fault structure, determining the abnormal occurrence time length and the abnormal occurrence starting time point corresponding to the fault structure according to the abnormal operation data distribution information corresponding to the fault structure;

step 4: for any fault structure, identifying the related fault structure which has faults at the same time as the fault structure at the starting time point of the abnormal occurrence; acquiring abnormal operation data distribution information corresponding to a related fault structure;

Step 4: for any fault structure, acquiring abnormal operation data distribution information corresponding to the relevant fault structure of the fault structure, determining an association relation between the fault structure and the abnormal operation data distribution information of each relevant fault structure, and further determining the fault reason corresponding to the fault structure based on the initial fault reason corresponding to the fault structure, the initial fault reason corresponding to each relevant fault structure and the association relation between the fault structure and the abnormal operation data distribution information of each relevant fault structure.

Each of the above-described initial failure causes corresponds to one type of operation data distribution information, and the same type of operation data distribution information may correspond to one or more initial failure causes, and thus:

When only one initial fault reason is identified according to the abnormal data distribution information, the initial fault reason is used as the fault reason of the fault structure;

Under the condition that a plurality of initial fault reasons are identified according to the abnormal data distribution information, inquiring each related fault structure which is the same as the abnormal occurrence time length and the abnormal occurrence starting time point of the fault structure based on the abnormal occurrence time length and the abnormal occurrence starting time point in the abnormal operation data distribution information corresponding to the fault structure; and determining the fault reason corresponding to the fault structure based on the initial fault reason corresponding to the fault structure, the initial fault reason corresponding to each related fault structure and the association relation between the fault structure and the abnormal operation data distribution information of each related fault structure.

By the method for determining the fault cause of any fault structure, the initial fault cause between the fault structure and each related fault structure corresponding to the fault structure can be combined, the fault cause of any fault structure can be comprehensively analyzed, and more accurate fault cause can be determined for any fault structure.

The following further exemplarily provides a method for identifying an association relationship between the fault structure and abnormal operation data distribution information of each related fault structure, including:

Step 1: for any fault structure, determining a data interaction strategy between the fault structure and each related fault structure based on a data interaction mode between the fault structure and each related fault structure;

Step 2: based on a data interaction strategy between the fault structure and each related fault structure, identifying interaction association information between abnormal operation data distribution information corresponding to the fault structure and abnormal operation data distribution information corresponding to each related fault structure; the interaction correlation information characterizes the degree of correlation of data interactions between two failure structures.

Step 3: calculating the distribution trend corresponding to the operation data distribution information of the fault structure and each related fault structure in the same time length based on the operation data distribution information of the fault structure and each related fault structure in the same time length;

Step 4: based on the distribution trend of the fault structure corresponding to the operation data distribution information of each related fault structure in the same time length, calculating the similarity between the operation data distribution information of the fault structure and each related fault structure, and using the similarity as similar association information between the abnormal operation data distribution information corresponding to the fault structure and the abnormal operation data distribution information corresponding to each related fault structure; the similarity associated information characterizes the similarity of abnormal operation data distribution trends between two fault structures;

Step 5: and determining the association degree of the fault structure corresponding to each related fault structure based on the interaction association information and the similar association information between the abnormal operation data distribution information corresponding to the fault structure and the abnormal operation data distribution information corresponding to each related fault structure.

By determining the interactive association information and the similar association information between any fault structure and each related fault structure and determining the association degree between any fault structure and each related fault structure, the association relation between each fault structure can be comprehensively identified.

The following further exemplarily provides a specific method for determining the fault cause corresponding to the fault structure based on the initial fault cause corresponding to the fault structure, the initial fault cause corresponding to each related fault structure, and the association relationship between the fault structure and the abnormal operation data distribution information of each related fault structure, which includes:

step 1: for any fault structure, determining the similarity of the initial fault cause of the fault structure and the initial fault cause of each related fault structure based on the association relation between the fault structure and the abnormal operation data distribution information of each related fault structure;

step 2: for any fault mechanism, averaging the similarity between the fault structure and the initial fault reasons of each related fault structure to obtain average similarity;

step 3: taking the average similarity as a screening threshold, for any relevant fault structure, determining whether the initial fault reason similarity of the fault structure and any relevant fault structure is larger than the average similarity, and taking the initial fault reason corresponding to the initial fault reason similarity larger than the average similarity as a sub-fault reason of the fault structure;

step 4: each sub-fault cause is taken as a fault cause of the fault structure.

By identifying the association relationship of the abnormal operation data distribution information between the two fault structures, the similarity between the fault structure and each related fault structure can be accurately determined.

In another embodiment, as shown in fig. 3, a method for identifying and processing a chip fault is provided, and the method is applied to the terminal 102 in fig. 1 for illustration, and includes the following steps:

Step S302, obtaining structure parameter information corresponding to each component structure of the chip, data interaction mode information among each component structure and current state information of the chip.

Step S304, generating a structure model corresponding to each component structure according to the structure parameter information corresponding to each component structure.

Step S306, according to the data interaction mode information among the component structures, determining the data interaction strategy among the component structures and the connection mode among the component structures.

Step S308, according to the data interaction strategy among the component structures and the connection mode among the component structures, the structure models corresponding to the component structures are connected, and a digital twin model for the chip is obtained.

Step S310, running current state information through the digital twin model to obtain corresponding running process simulation data of the chip in a target time range.

Step S312, based on the running process simulation data, a fault structure and fault information corresponding to the fault structure are determined in each component structure of the chip.

Step S314, determining a fault protection and control strategy for each fault structure of the chip according to the fault information corresponding to each fault structure.

It should be noted that, the specific limitation of the above steps may be referred to the specific limitation of a chip fault recognition and processing method.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a chip fault recognition and processing device for realizing the above related chip fault recognition and processing method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the device for identifying and processing a chip fault provided below may refer to the limitation of the method for identifying and processing a chip fault hereinabove, and will not be repeated herein.

In one embodiment, as shown in fig. 4, there is provided a chip failure recognition and processing apparatus, including:

The acquiring module 402 is configured to acquire structural parameter information corresponding to each component structure of the chip, data interaction mode information between each component structure, and current state information of the chip;

The generating module 404 is configured to generate a digital twin model for the chip according to the structure parameter information corresponding to each component structure and the data interaction mode information between each component structure;

The simulation module 406 is configured to obtain running process simulation data corresponding to the chip in the target time range by running the current state information through the digital twin model;

The identifying module 408 is configured to determine a fault structure and fault information corresponding to the fault structure in each component structure of the chip based on the running process simulation data;

The determining module 410 is configured to determine a fault protection policy for each fault structure of the chip according to the fault information corresponding to each fault structure.

In one embodiment, the generating module 404 is specifically configured to generate a structural model corresponding to each component structure according to the structural parameter information corresponding to each component structure; determining a data interaction strategy between each component structure and a connection mode between each component structure according to the data interaction mode information between each component structure; and connecting the structure models corresponding to the component structures according to the data interaction strategy among the component structures and the connection mode among the component structures to obtain the digital twin model for the chip.

In one embodiment, the simulation module 406 is specifically configured to operate the current state data through a digital twin model, so as to obtain sub-operation process simulation data obtained by running the chip in each sub-operation process within the target time range; each sub-operation process is obtained by dividing the time length of the operation process according to the time sequence; sub-operation process simulation data obtained by operation of any sub-operation process in all sub-operation processes are used for operation of the next sub-operation process of any sub-operation process; and connecting the sub-operation process simulation data obtained by operation in each sub-operation process according to the time sequence to obtain the operation process simulation data corresponding to the chip in the target time range.

In one embodiment, the identification module 408 is specifically configured to divide the running process simulation data into structure running process simulation data corresponding to each component structure, and determine running data distribution information corresponding to each component structure according to the structure running process simulation data corresponding to each component structure; acquiring a normal operation data distribution range corresponding to each component structure in normal operation, and determining a component structure conforming to a fault structure judgment condition according to operation data distribution information and the normal operation data distribution range corresponding to each component structure; and taking the component structure meeting the fault structure judging condition as a fault structure, and determining abnormal operation condition information of the fault structure according to the operation data distribution information corresponding to the fault structure as fault information corresponding to the fault structure.

In one embodiment, the determining module 410 is specifically configured to identify a fault cause corresponding to each fault structure based on fault information corresponding to each fault structure; and acquiring current state information corresponding to each fault structure, and determining a fault protection and control strategy for each fault structure according to the fault reasons and the current state information corresponding to each fault structure.

In one embodiment, the determining module 410 is specifically configured to obtain current structure parameter information corresponding to each fault structure, and determine current running state information and current structure state information of each fault structure based on the current structure parameter information corresponding to each fault structure; and determining the current state information corresponding to each fault structure according to the current running state information and the current structure state information of each fault structure.

The above-described respective modules in the chip failure recognition and processing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing chip fault identification and processing data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for chip fault identification and handling.

It will be appreciated by those skilled in the art that the structure shown in FIG. 5 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided that includes a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of a chip fault identification and handling method as described above. The steps of a chip failure recognition and processing method herein may be the steps of a chip failure recognition and processing method of the above-described respective embodiments.

In one embodiment, a computer readable storage medium is provided, storing a computer program which, when executed by a processor, causes the processor to perform the steps of a chip fault identification and handling method as described above. The steps of a chip failure recognition and processing method herein may be the steps of a chip failure recognition and processing method of the above-described respective embodiments.

In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, causes the processor to perform the steps of a chip fault identification and handling method as described above. The steps of a chip failure recognition and processing method herein may be the steps of a chip failure recognition and processing method of the above-described respective embodiments.

The user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method for chip fault identification and handling, the method comprising:

obtaining structure parameter information corresponding to each component structure of a chip, data interaction mode information among the component structures and current state information of the chip;

generating a digital twin model for the chip according to the structure parameter information corresponding to each component structure and the data interaction mode information between each component structure;

Operating the current state information through the digital twin model to obtain operation process simulation data corresponding to the chip in a target time range;

2. The method according to claim 1, wherein generating the digital twin model for the chip according to the structure parameter information corresponding to each of the constituent structures and the data interaction manner information between each of the constituent structures includes:

Determining a data interaction strategy between the constituent structures and a connection mode between the constituent structures according to the data interaction mode information between the constituent structures;

And connecting the structure models corresponding to the component structures according to the data interaction strategy among the component structures and the connection mode among the component structures to obtain a digital twin model aiming at the chip.

3. The method according to claim 1, wherein the running the current state information through the digital twin model to obtain running process simulation data corresponding to the chip in a target time range includes:

operating the current state data through the digital twin model to obtain sub-operation process simulation data obtained by the chip operating in each sub-operation process in the target time range; each sub-operation process is obtained by dividing the time length of the operation process according to the time sequence; sub-operation process simulation data obtained by operation of any sub-operation process in the sub-operation processes are used for operation of the next sub-operation process of the any sub-operation process;

and connecting the sub-operation process simulation data obtained by operation in each sub-operation process according to the time sequence to obtain operation process simulation data corresponding to the chip in the target time range.

4. The method according to claim 1, wherein determining a failure structure and failure information corresponding to the failure structure in each of the constituent structures of the chip based on the running process simulation data comprises:

and taking the component structure meeting the fault structure judging condition as the fault structure, and determining abnormal operation condition information of the fault structure according to the operation data distribution information corresponding to the fault structure as fault information corresponding to the fault structure.

5. The method of claim 1, wherein determining a fault protection policy for each of the fault structures of the chip based on fault information corresponding to each of the fault structures comprises:

identifying fault reasons corresponding to the fault structures based on the fault information corresponding to the fault structures;

and acquiring current state information corresponding to each fault structure, and determining a fault protection and control strategy for each fault structure according to the fault reasons corresponding to each fault structure and the current state information.

6. The method of claim 5, wherein the obtaining current state information corresponding to each fault structure includes:

7. A chip fault recognition and handling device, the device comprising:

The acquisition module is used for acquiring structure parameter information corresponding to each component structure of the chip, data interaction mode information among the component structures and current state information of the chip;

the generation module is used for generating a digital twin model aiming at the chip according to the structure parameter information corresponding to each component structure and the data interaction mode information between each component structure;

the simulation module is used for operating the current state information through the digital twin model to obtain operation process simulation data corresponding to the chip in a target time range;

And the determining module is used for determining the fault control strategy of each fault structure of the chip according to the fault information corresponding to each fault structure.

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.