CN116302899A - Core particle fault diagnosis method and device - Google Patents

Core particle fault diagnosis method and device Download PDF

Info

Publication number
CN116302899A
CN116302899A CN202310557887.3A CN202310557887A CN116302899A CN 116302899 A CN116302899 A CN 116302899A CN 202310557887 A CN202310557887 A CN 202310557887A CN 116302899 A CN116302899 A CN 116302899A
Authority
CN
China
Prior art keywords
core particle
fault
determining
target
time length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310557887.3A
Other languages
Chinese (zh)
Other versions
CN116302899B (en
Inventor
王嘉诚
张少仲
张栩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongcheng Hualong Computer Technology Co Ltd
Original Assignee
Zhongcheng Hualong Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongcheng Hualong Computer Technology Co Ltd filed Critical Zhongcheng Hualong Computer Technology Co Ltd
Priority to CN202310557887.3A priority Critical patent/CN116302899B/en
Publication of CN116302899A publication Critical patent/CN116302899A/en
Application granted granted Critical
Publication of CN116302899B publication Critical patent/CN116302899B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3058Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Hardware Design (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Testing Or Measuring Of Semiconductors Or The Like (AREA)
  • Testing Of Individual Semiconductor Devices (AREA)

Abstract

The invention provides a method and a device for diagnosing core particle faults, and relates to the field of computer technology chips. The method comprises the following steps: acquiring an operation data set of a target chip comprising multiple cores in the current total operation duration; determining at least one operating mode from the operational dataset; determining the operation time length and characteristic parameters of each core particle in each working mode in the current operation total time length; determining the fault probability of each core particle in the target chip according to the operation time length and the characteristic parameters of each core particle in each working mode; and performing fault diagnosis on the target core particles with the fault probability larger than a preset threshold value. The scheme can effectively predict and accurately diagnose the fault core particle.

Description

Core particle fault diagnosis method and device
Technical Field
The invention relates to the technical field of computer chips, in particular to a method and a device for diagnosing core particle faults.
Background
The current high performance computing chip architecture uses a core particle architecture approach widely due to cost and scalability considerations. According to the performance requirements of different chips, a plurality of bare chips (namely core particles) are packaged together, and then are linked and expanded through a high-speed bus to form chips with different performances. However, after chip packaging based on the core architecture, one or more of the core failures may occur due to inter-core connection failures or collisions during transportation or even long-term operational losses. Meanwhile, after the multi-core chip faults are found, the packaged chip is always abandoned as a whole, and specific core particles which are likely to be faulty are difficult to predict in advance and the fault core particles are accurately diagnosed.
Disclosure of Invention
The invention provides a method and a device for diagnosing faults of core particles, which can effectively predict and accurately diagnose the fault core particles.
In a first aspect, an embodiment of the present invention provides a method for diagnosing a core fault, including:
acquiring an operation data set of a target chip comprising multiple cores in the current total operation duration;
determining at least one working mode according to the operation data set;
determining the operation time length and characteristic parameters of each core particle in each working mode in the current operation total time length;
determining the fault probability of each core particle in the target chip according to the operation time length and the characteristic parameters of each core particle in each working mode;
and performing fault diagnosis on the target core particles with the fault probability larger than a preset threshold value.
Optionally, the determining at least one operation mode according to the operation data set includes:
extracting first characteristics from the operation data set to obtain operation characteristic information; wherein the operational characteristic information comprises a triggered set of hardware events for the target chip;
and classifying the operation characteristic information to obtain different working modes corresponding to different operation characteristic information.
Optionally, the determining the operation duration and the characteristic parameters of each core particle in each working mode in the current operation total duration includes:
determining the operation time length of each working mode according to the operation data set;
extracting second characteristics from the operation data set to obtain characteristic parameters under each working mode; the characteristic parameters comprise the temperature of the target chip, the power supply voltage of each core particle and the working frequency;
for each core particle, the operation duration of the core particle in each working mode is counted.
Optionally, the determining the fault probability of each core particle in the target chip includes:
for each core particle, determining a fault score of the core particle according to the characteristic parameter and the operation time length of the core particle in each working mode, and determining the fault probability of the core particle based on a preset rule and the fault score;
the fault score is determined by the following formula:
Figure SMS_1
wherein, the liquid crystal display device comprises a liquid crystal display device,M i fault scores for characterizing the ith core particle; j is used for representing a j-th working mode; n is used for representing the total number of the working modes;t i,j for characterizing the run length of the ith core particle in the jth operating mode;f i,j for characterizing the operating frequency of the ith core particle in the jth mode of operation;V i,j a power supply voltage for representing the ith core particle in the jth working mode;I i for characterizing the leakage current of the ith core particle;P ei a calibration power for characterizing the ith core particle;T j for characterizing the temperature of the target chip in a j-th operating mode;T p for characterizing a safe operating temperature of the target chip; alpha and beta are respectively used for representing the score coefficient.
Optionally, the preset rule is that the fault score is positively correlated with the fault probability.
Optionally, the performing fault diagnosis on the target core particle with the fault probability greater than a preset threshold includes:
aiming at the target core particle with the fault probability larger than a preset threshold value, determining a hardware event triggered by the target core particle independently;
isolating the target core particle and running the hardware event with the target core particle;
judging whether the target core particle can run the hardware event or not;
if the judging result is negative, determining the target core particle fault;
if the judging result is yes, acquiring the operation data to be diagnosed obtained by operating the hardware event, and judging whether the operation data to be diagnosed is abnormal or not; and when the judgment result is yes, determining the target core particle fault.
Optionally, before the determining the failure probability of each core particle in the target chip, the method further includes:
judging whether each core particle in the target chip is replaced or not within the current running total time length;
if yes, updating the first use date of the core particle, and carrying out difference value operation on the current date and the first use date again, and calculating to obtain the total secondary operation time length of the core particle after replacement so as to determine the fault probability of the core particle in the total secondary operation time length.
In a second aspect, an embodiment of the present invention further provides a core fault diagnosis apparatus, including:
the acquisition module is used for acquiring an operation data set of a target chip comprising multiple cores in the current total operation duration;
the preprocessing module is used for determining at least one working mode according to the operation data set, and determining the operation time length and characteristic parameters of each core particle in each working mode in the current total operation time length;
the fault probability determining module is used for determining the fault probability of each core particle in the target chip according to the operation time length and the characteristic parameters of each core particle in each working mode;
and the fault diagnosis module is used for performing fault diagnosis on the target core particles with the fault probability larger than a preset threshold value.
In a third aspect, an embodiment of the present invention further provides a computing device, including a memory and a processor, where the memory stores a computer program, and the processor implements any one of the above-mentioned methods for diagnosing a core fault when executing the computer program.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium having stored thereon a computer program, which when executed in a computer, causes the computer to perform the method for diagnosing a core particle failure as set forth in any one of the above.
The embodiment of the invention provides a method and a device for diagnosing core particle faults, wherein the method determines the included working modes, the running time length and characteristic parameters of each core particle in each working mode by acquiring a running data set of a multi-core particle target chip in the current running total time length, and further determines the fault probability of each core particle in the target chip, so that when the fault probability is larger than a preset threshold value, fault diagnosis is carried out on the target core particle corresponding to the fault probability, and accurate positioning is realized. Therefore, through predicting the fault probability of each core particle and timely carrying out fault diagnosis on the target core particle with the fault probability larger than the preset threshold value, the fault diagnosis on each core particle is not needed in real time, and the accurate positioning and diagnosis on the fault core particle are realized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for diagnosing a core fault according to an embodiment of the present invention;
FIG. 2 is a hardware architecture diagram of a computing device according to one embodiment of the present invention;
fig. 3 is a block diagram of a core fault diagnosis apparatus according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by those skilled in the art without making any inventive effort based on the embodiments of the present invention are within the scope of protection of the present invention.
As shown in fig. 1, an embodiment of the present invention provides a method for diagnosing a core fault, including:
step 100, acquiring an operation data set of a target chip comprising multiple cores in the current total operation duration;
step 102, determining at least one working mode according to the operation data set;
104, determining the operation time length and characteristic parameters of each core particle in each working mode in the current operation total time length;
step 106, determining the fault probability of each core particle in the target chip according to the operation time length and the characteristic parameters of each core particle in each working mode;
and step 108, performing fault diagnosis on the target core particles with the fault probability larger than a preset threshold value.
In the embodiment of the invention, the operation data set of the multi-core target chip in the current operation total time length is obtained, the included operation mode, the operation time length of each core in each operation mode and the characteristic parameters are determined, and further the fault probability of each core in the target chip is determined, so that when the fault probability is larger than a preset threshold value, fault diagnosis is carried out on the target core corresponding to the fault probability, and accurate positioning is realized. Therefore, through predicting the fault probability of each core particle and timely carrying out fault diagnosis on the target core particle with the fault probability larger than the preset threshold value, not only is the fault diagnosis on each core particle unnecessary to be carried out in real time, but also the accurate positioning of the fault core particle is realized, so that the fault core particle is repaired or replaced, the utilization rate of the core particle is improved, the utilization rate of the target chip is also improved, and unnecessary whole chip discarding is reduced.
The manner in which the individual steps shown in fig. 1 are performed is described below.
First, for step 100, including but not limited to, by obtaining the date of first use of the target chip, performing a difference operation between the current date and the date of first use of the target chip, and determining the current total duration of operation.
In step 102, at least one operation mode is determined from the operation data set, including:
extracting first characteristics from the operation data set to obtain operation characteristic information; wherein the operational characteristic information comprises a triggered set of hardware events for the target chip;
and classifying the operation characteristic information to obtain different working modes corresponding to different operation characteristic information.
In the invention, the first feature extraction is carried out on the operation data set, the hardware event triggered by the target chip is determined, the hardware event is classified, and further, different working modes of the target chip are determined, so that the operation data set is divided according to the working modes.
In step 104, determining the operation duration and the characteristic parameters of each core particle in each operation mode within the current operation total duration, including:
determining the operation time length of each working mode according to the operation data set;
extracting second characteristics from the operation data set to obtain characteristic parameters under each working mode; the characteristic parameters comprise the temperature of the target chip, the power supply voltage of each core particle and the working frequency;
for each core particle, the operation duration of the core particle in each working mode is counted.
In the invention, based on the working mode of the target chip in the current total operation duration, the operation duration of the target chip in the working mode is determined for each working mode, and then the operation duration, the power supply voltage, the working frequency and the temperature of each core particle in the target chip in the working mode are further determined so as to score the subsequent fault probability of each core particle based on the characteristic parameters. It should be noted that, in different operation modes, the core grains participating in the operation may be different, so that it is necessary to further statistically determine the operation duration of each core grain in each operation mode, that is, the operation duration of the core grains not participating in the operation in the current operation mode is 0.
For step 106, determining a failure probability of each core particle in the target chip includes:
for each core particle, determining a fault score of the core particle according to the characteristic parameter and the operation time length of the core particle in each working mode, and determining the fault probability of the core particle based on a preset rule and the fault score;
the fault score is determined by the following formula:
Figure SMS_2
wherein, the liquid crystal display device comprises a liquid crystal display device,M i fault scores for characterizing the ith core particle; j is used for representing a j-th working mode; n is used for representing the total number of the working modes;t i,j for characterizing the run length of the ith core particle in the jth operating mode;f i,j for characterizing the operating frequency of the ith core particle in the jth mode of operation;V i,j a power supply voltage for representing the ith core particle in the jth working mode;I i for characterizing the leakage current of the ith core particle;P ei a calibration power for characterizing the ith core particle;T j for characterizing the temperature of the target chip in a j-th operating mode;T p for characterizing a safe operating temperature of the target chip; alpha and beta are respectively used for representing the score coefficient.
The calibration power is obtained from a chip manual or a chip specificationThe obtained working power supply voltage and clock frequency. Specifically, the calibration power may beP ei Maximum value of =60% ×operating power supply voltage×continuous output current, or may beP ei The resulting chip or pellet was tested for power at full run time =60% ×test.
In a preferred embodiment, the preset rule is that the failure score is positively correlated with the failure probability.
According to the invention, for each core particle, based on the obtained characteristic parameters and the operation time length of the core particle in each working mode, the fault score of the core particle in the current operation total time length is calculated, so that the influence of each factor on the performance of the core particle is comprehensively considered by considering the operation time length, the power supply voltage, the working frequency, the leakage current and the temperature of the target chip in each working mode, the calculated fault score can reflect the current performance state of the core particle, the higher the fault score is, the higher the probability of the core particle fault is, the worse the current performance state is, and a user is required to pay attention to and diagnose the core particle in time so as to repair or replace the core particle in time when the core particle is in fault, and the effective operation of the current target chip is ensured.
Aiming at step 108, performing fault diagnosis on the target core particle with the fault probability larger than a preset threshold value, including:
aiming at the target core particle with the fault probability larger than a preset threshold value, determining a hardware event triggered by the target core particle independently;
isolating the target core particle and running the hardware event with the target core particle;
judging whether the target core particle can run the hardware event or not;
if the judging result is negative, determining the target core particle fault;
if the judging result is yes, acquiring the operation data to be diagnosed obtained by operating the hardware event, and judging whether the operation data to be diagnosed is abnormal or not; and when the judgment result is yes, determining the target core particle fault.
It should be noted that, the functions of different core grains are different, and by running the hardware event triggered by the target core grain alone, it can determine whether the main function corresponding to the hardware event is faulty or not while determining the fault of the target core grain.
According to the invention, the fault diagnosis is carried out on the target core particles with the fault probability larger than the preset threshold value, so that real-time monitoring and diagnosis are not needed on all the core particles in the target chip, but the fault core particles can be found out in time while the accurate positioning of the fault core particles is ensured, and the diagnosis efficiency of the fault core particles is improved.
In a preferred embodiment, before determining the failure probability of each core particle in the target chip in step 106, the method further includes:
judging whether each core particle in the target chip is replaced or not within the current running total time length;
if yes, updating the first use date of the core particle, and carrying out difference value operation on the current date and the first use date again, and calculating to obtain the total secondary operation time length of the core particle after replacement so as to determine the fault probability of the core particle in the total secondary operation time length.
It should be noted that, in the current total operation duration, there may be a case of core replacement, for a replaced core, the operation duration should be calculated from the first use date after replacement, that is, the maximum actual operation duration of the core is the difference between the current date and the first use date, that is, the total secondary operation duration, and the fault score of the core is calculated in the total secondary operation duration, so that the fault probability of the core can be accurately reflected.
In a preferred embodiment, after performing fault diagnosis on the target core particle with the fault probability greater than the preset threshold, determining the fault core particle further includes: shielding the faulty core particle or replacing the core particle with the same function at the position of the faulty core particle.
In the invention, the positioning of the fault core particles of the multi-core chip is realized, and the fault core particles can be repaired, shielded or replaced, so that the whole discarding of the chip can be avoided, and the utilization rate of the chip is improved.
As shown in fig. 2 and 3, the embodiment of the invention provides a core fault diagnosis device. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. In terms of hardware, as shown in fig. 2, a hardware architecture diagram of a computing device where a core fault diagnosis apparatus is provided in an embodiment of the present invention, in addition to a processor, a memory, a network interface, and a nonvolatile memory shown in fig. 2, the computing device where the apparatus is located in the embodiment may generally include other hardware, such as a forwarding chip responsible for processing a packet, and so on. Taking a software implementation as an example, as shown in fig. 3, as a device in a logic sense, the device is formed by reading a corresponding computer program in a nonvolatile memory into a memory by a CPU of a computing device where the device is located. The present embodiment provides a core fault diagnosis device, including: an acquisition module 300, a preprocessing module 302, a fault probability determination module 304, and a fault diagnosis module 306;
an acquisition module 300, configured to acquire an operation data set of a target chip including multiple cores in a current total operation duration;
a preprocessing module 302, configured to determine at least one operation mode according to the operation data set, and determine an operation duration and a characteristic parameter of each core particle in each operation mode within the current total operation duration;
a fault probability determining module 304, configured to determine a fault probability of each core particle in the target chip according to an operation duration and a feature parameter of each core particle in each working mode;
and the fault diagnosis module 306 is configured to perform fault diagnosis on the target core particle with the fault probability greater than a preset threshold.
In some embodiments, the acquisition module 300 may be used to perform the above-described step 100, the preprocessing module 302 may be used to perform the above-described steps 102 and 104, the failure probability determination module 304 may be used to perform the above-described step 106, and the failure diagnosis module 306 may be used to perform the above-described step 108.
In some specific embodiments, the preprocessing module 302 is further configured to perform the following operations:
extracting first characteristics from the operation data set to obtain operation characteristic information; wherein the operational characteristic information comprises a triggered set of hardware events for the target chip;
and classifying the operation characteristic information to obtain different working modes corresponding to different operation characteristic information.
In some specific embodiments, the preprocessing module 302 is further configured to perform the following operations:
determining the operation time length of each working mode according to the operation data set;
extracting second characteristics from the operation data set to obtain characteristic parameters under each working mode; the characteristic parameters comprise the temperature of the target chip, the power supply voltage of each core particle and the working frequency;
for each core particle, the operation duration of the core particle in each working mode is counted.
In some specific embodiments, the preprocessing module 302 is further configured to perform the following operations:
judging whether each core particle in the target chip is replaced or not within the current running total time length;
if yes, updating the first use date of the core particle, and carrying out difference value operation on the current date and the first use date again, and calculating to obtain the total secondary operation time length of the core particle after replacement so as to determine the fault probability of the core particle in the total secondary operation time length.
In some specific embodiments, the failure probability determination module 304 is further configured to perform the following operations:
for each core particle, determining a fault score of the core particle according to the characteristic parameter and the operation time length of the core particle in each working mode, and determining the fault probability of the core particle based on a preset rule and the fault score; the preset rule is that the fault score and the fault probability are positively correlated;
the fault score is determined by the following formula:
Figure SMS_3
wherein, the liquid crystal display device comprises a liquid crystal display device,M i fault scores for characterizing the ith core particle; j is used for representing a j-th working mode; n is used for representing the total number of the working modes;t i,j for characterizing the run length of the ith core particle in the jth operating mode;f i,j for characterizing the operating frequency of the ith core particle in the jth mode of operation;V i,j a power supply voltage for representing the ith core particle in the jth working mode;I i for characterizing the leakage current of the ith core particle;P ei a calibration power for characterizing the ith core particle;T j for characterizing the temperature of the target chip in a j-th operating mode;T p for characterizing a safe operating temperature of the target chip; alpha and beta are respectively used for representing the score coefficient.
In a specific embodiment, the fault diagnosis module 306 is further configured to perform the following operations:
aiming at the target core particle with the fault probability larger than a preset threshold value, determining a hardware event triggered by the target core particle independently;
isolating the target core particle and running the hardware event with the target core particle;
judging whether the target core particle can run the hardware event or not;
if the judging result is negative, determining the target core particle fault;
if the judging result is yes, acquiring the operation data to be diagnosed obtained by operating the hardware event, and judging whether the operation data to be diagnosed is abnormal or not; and when the judgment result is yes, determining the target core particle fault.
In some specific embodiments, the apparatus further comprises a repair module for performing the following: shielding the faulty core particle or replacing the core particle with the same function at the position of the faulty core particle.
It will be appreciated that the configuration illustrated in the embodiments of the present invention does not constitute a specific limitation on a core fault diagnosis apparatus. In other embodiments of the invention, a core fault diagnostic apparatus may include more or fewer components than shown, or certain components may be combined, certain components may be split, or different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
The content of information interaction and execution process between the modules in the device is based on the same conception as the embodiment of the method of the present invention, and specific content can be referred to the description in the embodiment of the method of the present invention, which is not repeated here.
The embodiment of the invention also provides a computing device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the core particle fault diagnosis method in any embodiment of the invention when executing the computer program.
Embodiments of the present invention also provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor, causes the processor to perform a method of diagnosing a core fault in any of the embodiments of the present invention.
Specifically, a system or apparatus provided with a storage medium on which a software program code realizing the functions of any of the above embodiments is stored, and a computer (or CPU or MPU) of the system or apparatus may be caused to read out and execute the program code stored in the storage medium.
In this case, the program code itself read from the storage medium may realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code form part of the present invention.
Examples of the storage medium for providing the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD+RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer by a communication network.
Further, it should be apparent that the functions of any of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform part or all of the actual operations based on the instructions of the program code.
Further, it is understood that the program code read out by the storage medium is written into a memory provided in an expansion board inserted into a computer or into a memory provided in an expansion module connected to the computer, and then a CPU or the like mounted on the expansion board or the expansion module is caused to perform part and all of actual operations based on instructions of the program code, thereby realizing the functions of any of the above embodiments.
It is noted that relational terms such as first and second, and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one …" does not exclude the presence of additional identical elements in a process, method, article or apparatus that comprises the element.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: various media in which program code may be stored, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for diagnosing a core fault, comprising:
acquiring an operation data set of a target chip comprising multiple cores in the current total operation duration;
determining at least one working mode according to the operation data set;
determining the operation time length and characteristic parameters of each core particle in each working mode in the current operation total time length;
determining the fault probability of each core particle in the target chip according to the operation time length and the characteristic parameters of each core particle in each working mode;
and performing fault diagnosis on the target core particles with the fault probability larger than a preset threshold value.
2. The method of claim 1, wherein said determining at least one operational mode from said operational data set comprises:
extracting first characteristics from the operation data set to obtain operation characteristic information; wherein the operational characteristic information comprises a triggered set of hardware events for the target chip;
and classifying the operation characteristic information to obtain different working modes corresponding to different operation characteristic information.
3. The method of claim 1, wherein said determining the operational duration and characteristic parameters of each core in each of said operational modes within said current total operational duration comprises:
determining the operation time length of each working mode according to the operation data set;
extracting second characteristics from the operation data set to obtain characteristic parameters under each working mode; the characteristic parameters comprise the temperature of the target chip, the power supply voltage of each core particle and the working frequency;
for each core particle, the operation duration of the core particle in each working mode is counted.
4. The method of claim 3, wherein the determining the probability of failure of each die in the target die comprises:
for each core particle, determining a fault score of the core particle according to the characteristic parameter and the operation time length of the core particle in each working mode, and determining the fault probability of the core particle based on a preset rule and the fault score;
the fault score is determined by the following formula:
Figure QLYQS_1
wherein, the liquid crystal display device comprises a liquid crystal display device,M i fault scores for characterizing the ith core particle; j is used for representing a j-th working mode; n is used for representing the total number of the working modes;t i,j for characterizing the run length of the ith core particle in the jth operating mode;f i,j for characterizing the operating frequency of the ith core particle in the jth mode of operation;V i,j a power supply voltage for representing the ith core particle in the jth working mode;I i for characterizing the leakage current of the ith core particle;P ei a calibration power for characterizing the ith core particle;T j for characterizing the temperature of the target chip in a j-th operating mode;T p for characterizing a safe operating temperature of the target chip; alpha and beta are respectively used for representing the score coefficient.
5. The method of claim 4, wherein the predetermined rule is that the fault score is positively correlated with the fault probability.
6. The method according to any one of claims 1 to 5, wherein the performing fault diagnosis on the target core particle with the probability of fault greater than a preset threshold value comprises:
aiming at the target core particle with the fault probability larger than a preset threshold value, determining a hardware event triggered by the target core particle independently;
isolating the target core particle and running the hardware event with the target core particle;
judging whether the target core particle can run the hardware event or not;
if the judging result is negative, determining the target core particle fault;
if the judging result is yes, acquiring the operation data to be diagnosed obtained by operating the hardware event, and judging whether the operation data to be diagnosed is abnormal or not; and when the judgment result is yes, determining the target core particle fault.
7. The method according to any one of claims 1 to 5, further comprising, prior to said determining the probability of failure of each core particle in the target chip:
judging whether each core particle in the target chip is replaced or not within the current running total time length;
if yes, updating the first use date of the core particle, and carrying out difference value operation on the current date and the first use date again, and calculating to obtain the total secondary operation time length of the core particle after replacement so as to determine the fault probability of the core particle in the total secondary operation time length.
8. A core particle failure diagnosis apparatus, comprising:
the acquisition module is used for acquiring an operation data set of a target chip comprising multiple cores in the current total operation duration;
the preprocessing module is used for determining at least one working mode according to the operation data set, and determining the operation time length and characteristic parameters of each core particle in each working mode in the current total operation time length;
the fault probability determining module is used for determining the fault probability of each core particle in the target chip according to the operation time length and the characteristic parameters of each core particle in each working mode;
and the fault diagnosis module is used for performing fault diagnosis on the target core particles with the fault probability larger than a preset threshold value.
9. A computing device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the method of any of claims 1-7 when the computer program is executed.
10. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-7.
CN202310557887.3A 2023-05-18 2023-05-18 Core particle fault diagnosis method and device Active CN116302899B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310557887.3A CN116302899B (en) 2023-05-18 2023-05-18 Core particle fault diagnosis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310557887.3A CN116302899B (en) 2023-05-18 2023-05-18 Core particle fault diagnosis method and device

Publications (2)

Publication Number Publication Date
CN116302899A true CN116302899A (en) 2023-06-23
CN116302899B CN116302899B (en) 2023-07-28

Family

ID=86790978

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310557887.3A Active CN116302899B (en) 2023-05-18 2023-05-18 Core particle fault diagnosis method and device

Country Status (1)

Country Link
CN (1) CN116302899B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117148117A (en) * 2023-10-27 2023-12-01 中诚华隆计算机技术有限公司 Chiplet fault automatic detection and repair method and system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030060985A1 (en) * 2001-09-25 2003-03-27 Mitsubishi Denki Kabushiki Kaisha Computer-implemented method of defect analysis
CN108241114A (en) * 2016-12-26 2018-07-03 北京邮电大学 Online fault detection method, device and the fault detection circuit of hybrid circuit chip
WO2022068645A1 (en) * 2020-09-30 2022-04-07 深圳前海微众银行股份有限公司 Database fault discovery method, apparatus, electronic device, and storage medium
CN114300031A (en) * 2020-10-07 2022-04-08 美光科技公司 On-die testing of memory devices
CN115576738A (en) * 2022-12-08 2023-01-06 中诚华隆计算机技术有限公司 Method and system for realizing equipment fault determination based on chip analysis
CN115617739A (en) * 2022-09-27 2023-01-17 南京信息工程大学 Chip based on Chiplet architecture and control method
CN115668163A (en) * 2020-05-19 2023-01-31 美光科技公司 Opcode storage for on-die microprocessors
CN115658400A (en) * 2022-09-30 2023-01-31 贵州电网有限责任公司 Method and system for testing computing chip, computer equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030060985A1 (en) * 2001-09-25 2003-03-27 Mitsubishi Denki Kabushiki Kaisha Computer-implemented method of defect analysis
CN108241114A (en) * 2016-12-26 2018-07-03 北京邮电大学 Online fault detection method, device and the fault detection circuit of hybrid circuit chip
CN115668163A (en) * 2020-05-19 2023-01-31 美光科技公司 Opcode storage for on-die microprocessors
WO2022068645A1 (en) * 2020-09-30 2022-04-07 深圳前海微众银行股份有限公司 Database fault discovery method, apparatus, electronic device, and storage medium
CN114300031A (en) * 2020-10-07 2022-04-08 美光科技公司 On-die testing of memory devices
CN115617739A (en) * 2022-09-27 2023-01-17 南京信息工程大学 Chip based on Chiplet architecture and control method
CN115658400A (en) * 2022-09-30 2023-01-31 贵州电网有限责任公司 Method and system for testing computing chip, computer equipment and storage medium
CN115576738A (en) * 2022-12-08 2023-01-06 中诚华隆计算机技术有限公司 Method and system for realizing equipment fault determination based on chip analysis

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117148117A (en) * 2023-10-27 2023-12-01 中诚华隆计算机技术有限公司 Chiplet fault automatic detection and repair method and system
CN117148117B (en) * 2023-10-27 2023-12-29 中诚华隆计算机技术有限公司 Chiplet fault automatic detection and repair method and system

Also Published As

Publication number Publication date
CN116302899B (en) 2023-07-28

Similar Documents

Publication Publication Date Title
CN109284269B (en) Abnormal log analysis method and device, storage medium and server
CN116302899B (en) Core particle fault diagnosis method and device
JP2011163345A (en) System and method for monitoring gas turbine
US20150006451A1 (en) Document classification system with user-defined rules
CN116340072B (en) Fault detection method and device for multi-core chip
CN116502166B (en) Method, device, equipment and medium for predicting faults of target equipment
CN113837596A (en) Fault determination method and device, electronic equipment and storage medium
CN114758714A (en) Hard disk fault prediction method and device, electronic equipment and storage medium
Ahsan et al. Prognosis of gas turbine remaining useful life using particle filter approach
CN112579327A (en) Fault detection method, device and equipment
CN106899436A (en) A kind of cloud platform failure predication diagnostic system
CN113946983A (en) Method and device for evaluating weak links of product reliability and computer equipment
US8560105B2 (en) Automated logistics support system incorporating a product integrity analysis system
CN110795308A (en) Server inspection method, device, equipment and storage medium
CN115941441A (en) System link automation monitoring operation and maintenance method, system, equipment and medium
CN115904698A (en) Method for managing OPC UA architecture by using YAML format file
CN114692082A (en) Method, system, device and medium for identifying battery swapping user
CN109003643A (en) A kind of data processing method and device
CN115480944A (en) Black screen fault analysis method and device of vehicle-mounted entertainment terminal, vehicle and medium
US10216561B2 (en) Monitor performance analysis
CN116340046B (en) Core particle fault detection method and device
US20090006006A1 (en) Method and Apparatus For Determining An End of Service Life
CN115952172B (en) Data matching method and device based on database temporary table
CN115829543B (en) Method for determining validity of preventive test of power equipment based on fault detection interval
CN112612882B (en) Review report generation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant