CN116795648A - Method and device for detecting server, storage medium and electronic device - Google Patents

Method and device for detecting server, storage medium and electronic device Download PDF

Info

Publication number
CN116795648A
CN116795648A CN202310778073.2A CN202310778073A CN116795648A CN 116795648 A CN116795648 A CN 116795648A CN 202310778073 A CN202310778073 A CN 202310778073A CN 116795648 A CN116795648 A CN 116795648A
Authority
CN
China
Prior art keywords
cpu
target
pressure test
ccix
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310778073.2A
Other languages
Chinese (zh)
Inventor
张增建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202310778073.2A priority Critical patent/CN116795648A/en
Publication of CN116795648A publication Critical patent/CN116795648A/en
Pending legal-status Critical Current

Links

Abstract

The embodiment of the application provides a method and a device for detecting a server, a storage medium and an electronic device, wherein the method comprises the following steps: under the condition that target parameters in a target register in a server are set as target values, executing pressure test on a group of CPUs in the server in a mode of increasing CPU load step by step, wherein the target values are used for controlling the interconnection times between different CPUs in the group of CPUs based on a CCIX protocol to be larger than a preset default value in the process of executing the pressure test; acquiring a target log generated by executing the pressure test, wherein the target log is used for recording log information corresponding to the CCIX interconnection fault when the CCIX interconnection fault occurs in the process of executing the pressure test; and determining whether CCIX interconnection faults occur among different CPUs in the group of CPUs according to the target log. By the embodiment of the application, the technical problem of lower detection efficiency of the server in the related technology is solved.

Description

Method and device for detecting server, storage medium and electronic device
Technical Field
The embodiment of the application relates to the technical field of servers, in particular to a detection method and device of a server, a storage medium and an electronic device.
Background
Machine learning and big data employ a profound and revolutionary way of data processing. By customizing the off-chip accelerator, the application of the traditional processor from calculation to network is enhanced; this has driven the industry as a whole towards accelerators and heterogeneous computing. For many computing tasks today, accelerators can perform desired functions faster and with lower power consumption than a single processor. However, uncontrolled isomerism can lead to software complexity. Cache coherent accelerator interconnect (Cache Coherent Interconnect for Accelerators, CCIX) is an inter-chip interconnect that enables two or more devices to share data in a cache coherent manner. CCIX aims to optimize, simplify the architectural design of heterogeneous systems while increasing the bandwidth of the system, reducing latency based on processors of different Instruction Sets (ISA) or application specific accelerators. For example, for a server system with multiple CPUs, the multiple CPUs are interconnected through CCIX, and as the temperature and load of the CPUs change, the CPUs can perform operations such as handshaking, synchronization, and parameter retraining to maintain consistency between the two CPUs. If the CPU CCIX problem or fault occurs, the performance of the server is reduced, the system is abnormal and even the system is down.
In the related art, when testing CCIX, an external device is often used for testing a server system of a multi-path CPU, for example, an instrument such as a protocol analyzer is used, a pin on a motherboard is connected by wire bonding, and then characteristics such as a signal, a voltage, a current and the like are tested, and whether the CPU CCIX is normal is determined by the characteristics. However, the method in the related art needs to weld the motherboard, so that the original motherboard pins are easily damaged, and the motherboard or the CPU cannot be assembled and delivered; and can not be analyzed by using external equipment such as a protocol analyzer and the like on a server-by-server basis in a server assembly factory. In the related art, when the server detects whether the CCIX problem exists, the whole of the multiple paths of CPUs included in the server is detected, and when the CCIX problem exists, the multiple paths of CPUs are returned to the original manufacturer together, so that each CPU in the multiple paths of CPUs cannot be accurately detected or diagnosed. It can be seen that the detection efficiency for the server in the related art is low.
Aiming at the technical problem of lower detection efficiency of a server in the related technology, no effective solution is proposed at present.
Disclosure of Invention
The embodiment of the application provides a method and a device for detecting a server, a storage medium and an electronic device, which are used for at least solving the technical problem of lower detection efficiency of the server in the related technology.
According to an embodiment of the present application, there is provided a method for detecting a server, including: under the condition that target parameters in a target register in a server are set as target values, executing pressure test on a group of CPUs in the server in a mode of increasing CPU load step by step, wherein the target values are used for controlling the interconnection times of different CPUs in the group of CPUs based on the cache consistency to be larger than a preset default value in the process of executing the pressure test; obtaining a target log generated by executing the pressure test, wherein the target log is used for recording log information corresponding to a CCIX interconnection fault under the condition that the CCIX interconnection fault occurs in the process of executing the pressure test; and determining whether the CCIX interconnection fault occurs between different CPUs in the set of CPUs according to the target log.
In one exemplary embodiment, the performing a stress test on a set of CPUs in the server in a stepwise increasing CPU load includes: executing the pressure test on a first CPU and a second CPU in a manner of gradually increasing the load of the first CPU until the load of the first CPU is a preset first maximum value under the condition that the load of the second CPU is kept to be zero, wherein the group of CPUs comprises the first CPU and the second CPU, and the first CPU and the second CPU are interconnected based on the CCIX protocol; and/or executing the pressure test on the first CPU and the second CPU in a manner of gradually increasing the load of the second CPU until the load of the second CPU is a preset second maximum value under the condition that the load of the first CPU is kept to be the preset first maximum value.
In an exemplary embodiment, the determining, according to the target log, whether the CCIX interconnection fault occurs between different CPUs in the set of CPUs includes at least one of: determining that the CCIX interconnection fault occurs between at least some CPUs in the set of CPUs in the case that a target field is included in the target log, wherein the target field is used for indicating that a hardware fault corresponding to the CCIX interconnection fault occurs in the server; and under the condition that the target log comprises target alarm information, determining that the CCIX interconnection fault occurs among at least part of the CPUs in the group of CPUs, wherein the target alarm information is used for indicating CPU alarms corresponding to the CCIX interconnection fault.
In one exemplary embodiment, the performing a stress test on a set of CPUs in the server in a stepwise increasing CPU load includes: performing a first pressure test on a first CPU and a second CPU with a load of the first CPU set to a first preset proportion of a preset first maximum value and a load of the second CPU set to zero, wherein the set of CPUs includes the first CPU and the second CPU, and the first CPU and the second CPU are interconnected based on the CCIX protocol; performing a second pressure test on the first CPU and the second CPU with the load of the first CPU set to a second preset proportion of the first maximum value and the load of the second CPU set to zero; performing a third pressure test on the first CPU and the second CPU with the load of the first CPU set to the second preset proportion of the first maximum value and the load of the second CPU set to a third preset proportion of a preset second maximum value; performing a fourth pressure test on the first CPU and the second CPU with the load of the first CPU set to the second preset proportion of the first maximum value and the load of the second CPU set to a fourth preset proportion of the second maximum value; the second preset proportion is larger than the first preset proportion, the fourth preset proportion is larger than the third preset proportion, and the pressure test comprises the first pressure test, the second pressure test, the third pressure test and the fourth pressure test.
In an exemplary embodiment, the method further comprises: and under the condition that the CCIX interconnection fault occurs between the first CPU and the second CPU, determining that at least one of the first CPU and the second CPU has the fault associated with the CCIX interconnection fault according to the target log.
In an exemplary embodiment, the determining, according to the target log, that at least one of the first CPU and the second CPU has failed in association with the CCIX interconnection failure includes: acquiring target time recorded in the target log, wherein the target time is recorded in the target log and is used for generating the CCIX interconnection fault; determining that the first CPU has a fault associated with the CCIX interconnection fault if the target time is located in a first time interval or a second time interval, wherein the first time interval is a time interval in which the first pressure test is performed, the second time interval is a time interval in which the second pressure test is performed, and a maximum value of the first time interval is less than or equal to a minimum value of the second time interval; and determining that the second CPU has a fault associated with the CCIX interconnection fault if the target time is located in a third time interval or a fourth time interval, wherein the third time interval is a time interval in which the third pressure test is performed, the fourth time interval is a time interval in which the fourth pressure test is performed, and a maximum value of the third time interval is less than or equal to a minimum value of the fourth time interval.
In an exemplary embodiment, the method further comprises: and under the condition that the CCIX interconnection fault occurs between the first CPU and the second CPU, determining fault description information according to the target log, wherein the fault description information is used for describing that at least one of the first CPU and the second CPU has a fault associated with the CCIX interconnection fault.
In an exemplary embodiment, the determining fault description information according to the target log includes: acquiring target time recorded in the target log, wherein the target time is recorded in the target log and is used for generating the CCIX interconnection fault; determining the fault description information as first fault description information according to a first test result when the target time is in a first time interval, wherein the first time interval is a time interval for executing the first pressure test, the first fault description information is used for indicating that the first CPU has the fault under the condition that a first target temperature is the temperature of the first CPU at the target time in the process of executing the first pressure test and the load of the first CPU is the first preset proportion of the first maximum value, and the first test result is a test result obtained by executing the first pressure test, and the first test result comprises the target time and the first target temperature with corresponding relations; determining the fault description information as second fault description information according to a second test result when the target time is located in a second time interval, wherein the second time interval is a time interval for executing the second pressure test, the second fault description information is used for indicating that the first CPU fails under the condition that a second target temperature and the load of the first CPU are in a second preset proportion of the first maximum value, the second target temperature is a temperature of the first CPU at the target time in the process of executing the second pressure test, the second test result is a test result obtained by executing the second pressure test, and the second test result comprises the target time and the second target temperature with a corresponding relation, and the maximum value of the first time interval is smaller than or equal to the minimum value of the second time interval; determining the fault description information as third fault description information according to a third test result when the target time is in a third time interval, wherein the third time interval is a time interval for executing the third pressure test, the third fault description information is used for indicating that the second CPU has the fault under the condition that a third target temperature is a temperature of the second CPU at the target time in the process of executing the third pressure test, and the third test result is a test result obtained by executing the third pressure test, and the third test result comprises the target time and the third target temperature with corresponding relations; and determining the fault description information as fourth fault description information according to a fourth test result when the target time is in a fourth time interval, wherein the fourth time interval is a time interval for executing the fourth pressure test, the fourth fault description information is used for indicating that the fault occurs to the second CPU under the condition that a fourth target temperature is a fourth preset proportion of the load of the second CPU to the second maximum value, the fourth target temperature is a temperature of the second CPU on the target time in the process of executing the fourth pressure test, the fourth test result is a test result obtained by executing the fourth pressure test, and the fourth test result comprises the target time and the fourth target temperature with a corresponding relation, and the maximum value of the third time interval is smaller than or equal to the minimum value of the fourth time interval.
In an exemplary embodiment, before the performing the stress test on the set of CPUs in the server in a stepwise increasing CPU load, the method further includes: and setting the target parameter in the target register to the target value in response to the acquired setting instruction.
According to another embodiment of the present application, there is also provided a detection apparatus for a server, including: the execution module is used for executing pressure test on a group of CPUs in the server in a mode of increasing CPU load step by step under the condition that target parameters in a target register in the server are set as target values, wherein the target values are used for controlling the number of times of interconnection between different CPUs in the group of CPUs based on the cache consistency of an accelerator interconnection CCIX protocol to be larger than a preset default value in the process of executing the pressure test; the acquisition module is used for acquiring a target log generated by executing the pressure test, wherein the target log is used for recording log information corresponding to the CCIX interconnection fault under the condition that the CCIX interconnection fault occurs in the process of executing the pressure test; and the first determining module is used for determining whether the CCIX interconnection fault occurs between different CPUs in the group of CPUs according to the target log.
According to a further embodiment of the application, there is also provided a computer readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
According to a further embodiment of the application, there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
According to the embodiment of the application, under the condition that the target parameter of the target register in the server is set as the target value, the pressure test is executed on a group of CPUs in the server in a mode of increasing the CPU load step by step, wherein the target value is used for controlling the interconnection times between different CPUs in the group of CPUs based on the CCIX protocol to be greater than the preset default value in the process of executing the pressure test, then a target log generated by executing the pressure test is obtained, the log information corresponding to the CCIX interconnection fault is recorded when the CCIX interconnection fault occurs in the process of executing the pressure test in the target log, and then whether the CCIX interconnection fault occurs between the different CPUs in the group of CPUs can be determined according to the target log. The problem that in the related art, whether CCIX faults exist or not is analyzed by using external equipment such as a protocol analyzer one by one of the servers, so that the detection efficiency is low is avoided, and the problem that in the related art, whether faults exist in each CPU cannot be accurately detected or diagnosed due to the fact that the whole of multiple paths of CPUs in the servers are detected is avoided. Therefore, the technical problem of lower detection efficiency of the server in the related technology can be solved, and the effect of improving the detection efficiency of the server is achieved.
Drawings
Fig. 1 is a schematic diagram of a hardware environment of a server according to a detection method of the server according to an embodiment of the present application;
FIG. 2 is a flow chart of a method of detecting a server according to an embodiment of the present application;
FIG. 3 is a flow chart of anomaly detection for a two-way CPU server according to an embodiment of the present application;
FIG. 4 is a graph of CPU usage versus temperature change according to an embodiment of the present application;
fig. 5 is a structural diagram of a detection apparatus of a server according to an embodiment of the present application.
Detailed Description
In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The method embodiments provided by the embodiments of the present application may be performed in a server, a computer terminal, a device terminal, or similar computing apparatus. Taking the operation on a server as an example, fig. 1 is a schematic diagram of a hardware environment of a server according to a detection method of the server according to an embodiment of the present application. As shown in fig. 1, the server may include one or more processors 102 (only one is shown in fig. 1) (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, and in one exemplary embodiment, the server may further include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative, and is not intended to limit the structure of the server described above. For example, a server may also include more or fewer components than shown in FIG. 1, or have a different configuration than the equivalent functions shown in FIG. 1 or more than the functions shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a detection method of a server in an embodiment of the present application, and the processor 102 executes the computer program stored in the memory 104, thereby performing various functional applications and data processing, that is, implementing the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located with respect to the processor 102, which may be connected to a server via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a server. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
In this embodiment, a method for detecting a server is provided, and fig. 2 is a flowchart of a method for detecting a server according to an embodiment of the present application, as shown in fig. 2, where the flowchart includes the following steps:
step S202, under the condition that a target parameter in a target register in a server is set as a target value, executing a pressure test on a group of CPUs in the server in a mode of increasing CPU load step by step, wherein the target value is used for controlling the number of times of interconnection between different CPUs in the group of CPUs based on an accelerator interconnection CCIX protocol of cache consistency to be larger than a preset default value in the process of executing the pressure test;
Step S204, obtaining a target log generated by executing the pressure test, wherein the target log is used for recording log information corresponding to the CCIX interconnection fault under the condition that the CCIX interconnection fault occurs in the process of executing the pressure test;
step S206, determining whether the CCIX interconnection fault occurs between different CPUs in the set of CPUs according to the target log.
Through the steps, under the condition that the target parameter of the target register in the server is set to be the target value, the pressure test is executed on a group of CPUs in the server in a mode of increasing the CPU load step by step, wherein the target value is used for controlling the interconnection times between different CPUs in the group of CPUs based on the CCIX protocol to be larger than a preset default value in the process of executing the pressure test, then a target log generated by executing the pressure test is obtained, log information corresponding to the CCIX interconnection fault is recorded when the CCIX interconnection fault occurs in the process of executing the pressure test in the target log, and then whether the CCIX interconnection fault occurs between the different CPUs in the group of CPUs can be determined according to the target log. The problem that in the related art, whether CCIX faults exist or not is analyzed by using external equipment such as a protocol analyzer one by one of the servers, so that the detection efficiency is low is avoided, and the problem that in the related art, whether faults exist in each CPU cannot be accurately detected or diagnosed due to the fact that the whole of multiple paths of CPUs in the servers are detected is avoided. Therefore, the technical problem of lower detection efficiency of the server in the related technology can be solved, and the effect of improving the detection efficiency of the server is achieved.
The main execution body of the steps may be a device end, such as a detection device, a detection tool, test software of a terminal, or a processor with man-machine interaction capability configured on a storage device, or a processing device or a processing unit with similar processing capability, but is not limited thereto.
In the technical solution provided in step S202, by setting the target parameter of the target register in the server, for example, the target register may be a CCIX strin setting register of the CPU, in the related art, the value of CCIX strin in the CPU defaults to only 3 times, and when strin operation is successful, strin operation will not be performed any more, but in the embodiment of the present application, the target value of CCIX strin may be set to 150 times (or other times), that is, the number of strin operations may be changed to the target value by changing the register, and the strin operation of the number of times (for example, 150 times) corresponding to the target value may still be completed when strin is successful, so that the number of CCIX interconnection may be increased; namely, by setting the CPU register so that CCIX retrain is triggered at high frequency and simultaneously executing pressure test on different CPUs in a group of CPUs in a mode of increasing CPU load step by step, the aim of rapidly exposing the problem of the CPU CCIX can be realized.
In the technical solution provided in step S204, a target log generated by executing the stress test is obtained, where the target log may be used to record log information corresponding to the CCIX interconnection fault when the CCIX interconnection fault occurs in the process of executing the stress test, that is, when the CCIX interconnection fault occurs between different CPUs, the target log may record relevant fault log information.
In the technical solution provided in step S206, whether a CCIX interconnection fault occurs between different CPUs in a group of CPUs may be determined according to the target log, for example, a dmesg, messages log under the OS is detected, whether a hard Error keyword occurs is checked, and when the hard Error keyword occurs in the log, it indicates that a problem of CCIX is triggered, the CPU is likely to have a fault, and the bad CPU is blocked. Alternatively, the alarm log is detected by the BMC sel, and ipmitool sel list checks whether the information has an alarm. If there is an alarm directed to the CPU, the error interception is reported.
By the embodiment, the problem that in the related art, whether CCIX faults exist or not is analyzed by using external equipment such as a protocol analyzer one by the server, so that the detection efficiency is low is avoided, and the problem that in the related art, whether faults exist in each CPU cannot be accurately detected or diagnosed due to the fact that the whole of multiple paths of CPUs in the server are detected can be avoided. Therefore, the technical problem of lower detection efficiency of the server in the related technology can be solved, and the effect of improving the detection efficiency of the server is achieved.
In an alternative embodiment, the performing the stress test on a set of CPUs in the server in a stepwise increasing CPU load includes: executing the pressure test on a first CPU and a second CPU in a manner of gradually increasing the load of the first CPU until the load of the first CPU is a preset first maximum value under the condition that the load of the second CPU is kept to be zero, wherein the group of CPUs comprises the first CPU and the second CPU, and the first CPU and the second CPU are interconnected based on the CCIX protocol; and/or executing the pressure test on the first CPU and the second CPU in a manner of gradually increasing the load of the second CPU until the load of the second CPU is a preset second maximum value under the condition that the load of the first CPU is kept to be the preset first maximum value.
In the above embodiment, a group of CPUs including a first CPU and a second CPU is taken as an example, where the first CPU may include one or more CPUs, and the second CPU may also include one or more CPUs, where the pressure test is performed on the group of CPUs in a manner of gradually increasing the load of the first CPU until the load of the first CPU is a preset first maximum value, for example, the preset first maximum value is 100%, that is, the load usage rate corresponding to the first CPU is 100%, for example, the first CPU is 4 cores (or 8 cores, or 16 cores, or other), and the corresponding preset first maximum value is 4 cores in the first CPU all run or run, where, of course, the preset first maximum value may not be 100%; in practical application, a mode of increasing the load step by step may be set according to the need, for example, the load of the first CPU is increased step by four steps of 25%, 50%, 75% and 100%, or the load of the first CPU is increased step by two steps of 50% and 100%, respectively; and/or, when the load of the first CPU is kept at a preset first maximum value (for example, the utilization rate of the first CPU is 100%), the pressure test is performed on a group of CPUs in a manner of gradually increasing the load of the second CPU until the load of the second CPU is a preset second maximum value, and similarly to the manner of increasing the load of the first CPU, the load of the second CPU may also be increased in a manner of gradually increasing the load of four stages of 25%, 50%, 75% and 100%, or the load of the second CPU may be increased in a manner of gradually increasing the load of 50% and 100% respectively. According to the embodiment, the CPU CCIX problem or fault is exposed to the server in a mode of increasing the CPU load step by step, so that the aim of screening bad CPUs with the CCIX problem can be effectively achieved.
In an alternative embodiment, said determining, based on said target log, whether said CCIX interconnection fault occurs between different CPUs in said set of CPUs includes at least one of: determining that the CCIX interconnection fault occurs between at least some CPUs in the set of CPUs in the case that a target field is included in the target log, wherein the target field is used for indicating that a hardware fault corresponding to the CCIX interconnection fault occurs in the server; and under the condition that the target log comprises target alarm information, determining that the CCIX interconnection fault occurs among at least part of the CPUs in the group of CPUs, wherein the target alarm information is used for indicating CPU alarms corresponding to the CCIX interconnection fault.
In the above embodiment, when the target field is included in the target log, it is determined that a CCIX interconnection fault occurs between at least some CPUs in the group of CPUs, for example, the target field is a hard Error key, the dmesg, messages log under the OS may be detected, that is, whether the hard Error key occurs is checked according to the target log, and when the hard Error key occurs in the log, the occurrence of the hard Error key indicates that the fault is triggered to the CCIX, and that the CPU is likely to have a fault. Optionally, it may also be determined that there is a CCIX interconnection fault between at least some of the CPUs in the group of CPUs when it is determined that the target log includes the target alarm information, for example, by detecting the alarm log through the BMC sel, ipmitool sel list, checking whether the information is alarmed. By the embodiment, the aim of determining whether CCIX interconnection faults occur among different CPUs in a group of CPUs according to the target log is fulfilled.
In an alternative embodiment, the performing the stress test on a set of CPUs in the server in a stepwise increasing CPU load includes: performing a first pressure test on a first CPU and a second CPU with a load of the first CPU set to a first preset proportion of a preset first maximum value and a load of the second CPU set to zero, wherein the set of CPUs includes the first CPU and the second CPU, and the first CPU and the second CPU are interconnected based on the CCIX protocol; performing a second pressure test on the first CPU and the second CPU with the load of the first CPU set to a second preset proportion of the first maximum value and the load of the second CPU set to zero; performing a third pressure test on the first CPU and the second CPU with the load of the first CPU set to the second preset proportion of the first maximum value and the load of the second CPU set to a third preset proportion of a preset second maximum value; performing a fourth pressure test on the first CPU and the second CPU with the load of the first CPU set to the second preset proportion of the first maximum value and the load of the second CPU set to a fourth preset proportion of the second maximum value; the second preset proportion is larger than the first preset proportion, the fourth preset proportion is larger than the third preset proportion, and the pressure test comprises the first pressure test, the second pressure test, the third pressure test and the fourth pressure test.
In the above embodiment, taking the preset first maximum value as an example (of course, other values may be used) where the usage rate of the first CPU is 100%, if the first preset ratio is 50% (or 60%, or other values), that is, where the load of the first CPU is 50% and the load of the second CPU is zero, the first pressure test is performed on the first CPU and the second CPU in the group of CPUs, which may be understood as performing the first stage pressure test on the group of CPUs; if the second preset ratio is 100% (or 90%, or other), that is, if the load of the first CPU is 100% and the load of the second CPU is zero, the second pressure test is performed on the first CPU and the second CPU in the group of CPUs, which can be understood as performing the second stage pressure test on the group of CPUs; taking the preset second maximum value as 100% (of course, other values are also possible), if the third preset ratio is 50% (or other values), that is, if the load of the first CPU is 100% and the load of the second CPU is 50%, the third pressure test is performed on the first CPU and the second CPU in the group of CPUs, which can be understood as performing the third stage pressure test on the group of CPUs; if the fourth preset ratio is 100% (or 90%, or other), that is, if the load of the first CPU is 100% and the load of the second CPU is 100%, the fourth pressure test is performed on the first CPU and the second CPU in the group of CPUs, which may be understood as performing the fourth stage pressure test on the group of CPUs; in this embodiment, by performing four-stage pressure test on a group of CPUs, and the load of the CPUs from the first stage to the fourth stage is gradually increased, however, in practical applications, the load of the first CPU or the second CPU may be further divided into more stages, for example, the load of the first CPU or the load of the second CPU may be further divided into four stages, namely, 25%, 50%, 75% and 100% stages, respectively, so that the test time is correspondingly increased. In addition, in the pressure test of each stage, the test duration of each stage may be set according to the needs, for example, the test duration of each stage in the four stages is 15 minutes (or 10 minutes, or other durations), or different test durations may be set for different stages. The purpose that the CPU may malfunction under different conditions can be detected by increasing the CPU load (or CPU usage) step by step.
In an alternative embodiment, the method further comprises: and under the condition that the CCIX interconnection fault occurs between the first CPU and the second CPU, determining that at least one of the first CPU and the second CPU has the fault associated with the CCIX interconnection fault according to the target log.
In the above embodiment, when it is determined that the CCIX interconnection fault occurs between the first CPU and the second CPU, it may also be determined that at least one of the first CPU and the second CPU is a problem CPU associated with the CCIX interconnection fault according to the target log. Optionally, the target log may include an identifier of the CPU corresponding to the occurrence of the hard Error key, for example, the identifier of the CPU may indicate the first CPU or the second CPU.
In an alternative embodiment, said determining, from said target log, that at least one of said first CPU and said second CPU is experiencing a failure associated with said CCIX interconnection failure comprises: acquiring target time recorded in the target log, wherein the target time is recorded in the target log and is used for generating the CCIX interconnection fault; determining that the first CPU has a fault associated with the CCIX interconnection fault if the target time is located in a first time interval or a second time interval, wherein the first time interval is a time interval in which the first pressure test is performed, the second time interval is a time interval in which the second pressure test is performed, and a maximum value of the first time interval is less than or equal to a minimum value of the second time interval; and determining that the second CPU has a fault associated with the CCIX interconnection fault if the target time is located in a third time interval or a fourth time interval, wherein the third time interval is a time interval in which the third pressure test is performed, the fourth time interval is a time interval in which the fourth pressure test is performed, and a maximum value of the third time interval is less than or equal to a minimum value of the fourth time interval.
In the above embodiment, the target time of occurrence of the CCIX interconnection fault recorded in the target log may be obtained, that is, the target log also records time information of occurrence of the CCIX fault in the server detection, and then, according to the relationship between the target time and the first time interval, the second time interval, the third time interval and the fourth time interval, the CPU in which the CCIX interconnection fault occurs is further determined, for example, when the target time is located in the first time interval or the second time interval, it may be determined that the first CPU has a problem associated with the CCIX interconnection fault, and the first time interval and the second time interval correspond to the time intervals in which the first pressure test and the second pressure test are performed, respectively, because the load of the second CPU is zero in the first time interval and the second time interval, the probability of occurrence of the problem associated with the CCIX interconnection fault by the first CPU is higher; as an alternative embodiment, it may be determined that the probability of the second CPU having a problem associated with the CCIX interconnection failure is high when the target time is located in a third time interval or a fourth time interval, which respectively correspond to the time intervals for performing the third pressure test and the fourth pressure test. According to the embodiment, according to the target time information of CCIX interconnection faults recorded in the target log, and simultaneously by combining the load conditions of pressure tests at different stages, the problem CPU related to the CCIX interconnection faults can be further determined, so that the problem that in the related technology, only the whole of multiple paths of CPUs included in a server is detected, and when the problem exists, the whole of the multiple paths of CPUs is returned to a factory, and each CPU in the multiple paths of CPUs cannot be accurately detected or diagnosed is avoided.
In an alternative embodiment, the method further comprises: and under the condition that the CCIX interconnection fault occurs between the first CPU and the second CPU, determining fault description information according to the target log, wherein the fault description information is used for describing that at least one of the first CPU and the second CPU has a fault associated with the CCIX interconnection fault.
In the above embodiment, when it is determined that the CCIX interconnection fault occurs between the first CPU and the second CPU, the fault description information may be further determined according to the target log, where the fault description information is used to describe that at least one of the first CPU and the second CPU has information related to the CCIX interconnection fault.
In an optional embodiment, the determining fault description information according to the target log includes: acquiring target time recorded in the target log, wherein the target time is recorded in the target log and is used for generating the CCIX interconnection fault; determining the fault description information as first fault description information according to a first test result when the target time is in a first time interval, wherein the first time interval is a time interval for executing the first pressure test, the first fault description information is used for indicating that the first CPU has the fault under the condition that a first target temperature is the temperature of the first CPU at the target time in the process of executing the first pressure test and the load of the first CPU is the first preset proportion of the first maximum value, and the first test result is a test result obtained by executing the first pressure test, and the first test result comprises the target time and the first target temperature with corresponding relations; determining the fault description information as second fault description information according to a second test result when the target time is located in a second time interval, wherein the second time interval is a time interval for executing the second pressure test, the second fault description information is used for indicating that the first CPU fails under the condition that a second target temperature and the load of the first CPU are in a second preset proportion of the first maximum value, the second target temperature is a temperature of the first CPU at the target time in the process of executing the second pressure test, the second test result is a test result obtained by executing the second pressure test, and the second test result comprises the target time and the second target temperature with a corresponding relation, and the maximum value of the first time interval is smaller than or equal to the minimum value of the second time interval; determining the fault description information as third fault description information according to a third test result when the target time is in a third time interval, wherein the third time interval is a time interval for executing the third pressure test, the third fault description information is used for indicating that the second CPU has the fault under the condition that a third target temperature is a temperature of the second CPU at the target time in the process of executing the third pressure test, and the third test result is a test result obtained by executing the third pressure test, and the third test result comprises the target time and the third target temperature with corresponding relations; and determining the fault description information as fourth fault description information according to a fourth test result when the target time is in a fourth time interval, wherein the fourth time interval is a time interval for executing the fourth pressure test, the fourth fault description information is used for indicating that the fault occurs to the second CPU under the condition that a fourth target temperature is a fourth preset proportion of the load of the second CPU to the second maximum value, the fourth target temperature is a temperature of the second CPU on the target time in the process of executing the fourth pressure test, the fourth test result is a test result obtained by executing the fourth pressure test, and the fourth test result comprises the target time and the fourth target temperature with a corresponding relation, and the maximum value of the third time interval is smaller than or equal to the minimum value of the fourth time interval.
In the above embodiment, the target time of occurrence of the CCIX interconnection fault recorded in the target log may be obtained, that is, the target log also records therein time information of occurrence of the CCIX fault in the server detection, and when the target time is located in the first time interval, the fault description information may be determined as first fault description information according to the first test result, for example, the first fault description information indicates that the first CPU has a fault in a case where the first CPU has a first target temperature (for example, 60 ℃ or other), and the load of the first CPU is a first preset proportion (for example, the load of the first CPU is 50%) of a preset first maximum value; similarly, when the target time is within the second time interval, the fault description information may be determined as second fault description information according to the second test result, for example, the second fault description information indicates that the first CPU has a fault under the condition that the second target temperature (for example, 90 ℃ or other) and the load of the first CPU is a second preset proportion (for example, the load of the first CPU is 100%) of the preset first maximum value; when the target time is within the third time interval, the fault description information may be determined as third fault description information according to a third test result, for example, the third fault description information indicates that the second CPU has a fault under a third target temperature (such as 70 ℃, or other) and a third preset proportion (for example, 50% of the load of the second CPU) of the preset first maximum value; when the target time is within the fourth time interval, the fault description information may be determined as fourth fault description information according to a fourth test result, for example, the fourth fault description information indicates that the second CPU has a fault in a case where the fourth target temperature (for example, 90 ℃ or other) and the load of the second CPU is a fourth preset proportion (for example, the load of the second CPU is 100%) of the preset first maximum value. In practical application, the fault description information can be fed back to a CPU manufacturer, and an important reference is provided for the CPU manufacturer to adjust and optimize the internal parameters of the CPU, so that the improvement of the CPU is facilitated.
In an alternative embodiment, before said performing a stress test on a set of CPUs in said server in a stepwise increasing CPU load, said method further comprises: and setting the target parameter in the target register to the target value in response to the acquired setting instruction.
In the above embodiment, before performing the pressure test on a set of CPUs, the target parameter in the target register is set, where the target register is a CCIX retrain setting register of the CPU, in the related art, the value of CCIX retrain in the CPU is only 3 times by default, and when the strin operation is successful, the strin operation will not be performed any more, but in the embodiment of the present application, the target value of CCIX retrain may be set to 150 times (or other values), that is, the number of strin operations may be changed to the target value by changing the register, and the strin operation of the number of times (for example, 150 times) corresponding to the target value may still be completed when the strin is successful, so that the number of CCIX interconnection may be increased; namely, by setting the CPU register to enable the CCIX retrain to be triggered at a high frequency, when the retrain operation is triggered among the CPUs, the CCIX interconnection fault or problem can be more easily exposed, and bad CPUs with the CCIX interconnection fault can be more effectively screened.
It will be apparent that the embodiments described above are merely some, but not all, embodiments of the application. The present application is specifically described below by taking the detection of the ARM two-way server CPU CCIX interconnection problem as an example.
The embodiment of the application provides a server system diagnosis method, which is characterized in that a CPU related register is arranged, so that a double-path CPU (corresponding to the group of CPUs) is actively triggered for multiple times by high probability, and then the aim of triggering the CPU CCIX interconnection problem is achieved by using an incremental step-by-step increasing CPU load mode through CPU load tools such as stress-ng and the like, so that the CPU or the server with problems is intercepted.
Fig. 3 is a flow chart of anomaly detection of a two-way CPU server according to an embodiment of the present application, the flow including:
s302, a CPU register (corresponding to the aforementioned target register) is set.
The number of times of active retraining can be set to 150 times (or other times corresponding to the target value) by changing the register, and 150 times of retraining operation can still be completed when the number of times of retraining is successful, so that the number of times of CCIX interconnection is increased, and the number of times of retraining operation can be set under the OS as follows:
printf'\x70\x40\x00\x00'>offset.bin
printf'\x01\x00\x00\x00'>value.bin
efivar--write--name a12544a4-bcc0-4b12-aa56-0a2a76f16563-Offset--datafile offset.bin
efivar--write--name a12544a4-bcc0-4b12-aa56-0a2a76f16563-Value--datafile value.bin
efivar--print--name a12544a4-bcc0-4b12-aa56-0a2a76f16563-Offset
printf'\xb8\x41\x00\x00'>offset.bin
printf'\x01\x00\x00\x00'>value.bin
efivar--write--name a12544a4-bcc0-4b12-aa56-0a2a76f16563-Offset--datafile offset.bin
efivar--write--name a12544a4-bcc0-4b12-aa56-0a2a76f16563-Value--datafile value.bin
efivar--print--name a12544a4-bcc0-4b12-aa56-0a2a76f16563-Offset。
S304, the OS restarts the reentry system to enable the register to be effective.
S306, background operation information sampling and collecting.
The method is mainly used for collecting and recording the temperature of the CPU0, the temperature of the CPU1 and the CPU utilization rate in the whole process so as to carry out data analysis, and along with the increase of production samples in the manufacturing field, the data can be used for analyzing at what temperature and under what load, the CCIX problem is easy to occur, so that the method has important significance for the internal tuning of the CPU.
ipmitool sdr list |grep-E "CPU0_Temp|CPU1_Temp" collects the temperatures of two CPUs once every 30 seconds;
the CPU load usage vmstat|awk '{ print100- $ (NF-2) }' |awk nr= 3 is collected every 30 seconds.
S308, 15 minutes 25% cpu load test and detect anomalies. The step S308 includes:
s30802, using stress-ng pressure tool, carrying out core binding pressure test on 50% of cores of CPU0 for 60 minutes, namely carrying out pressure load on 25% of CPUs of the whole double-path server, so that the CPU load of the whole server reaches 25%.
S30804, then detecting an abnormality as follows:
a) After 15 minutes of operation, detecting a dmesg, messages log under the OS, and checking whether a Hardwere Error keyword appears or not;
when Hardwere Error keywords appear in the log, which indicates that the trigger is to the CCIX problem, the CPU is likely to be bad, and bad CPU is intercepted.
b) The alarm log is detected by the BMC sel, and ipmitool sel list checks whether the information has an alarm.
If there is an alarm directed to the CPU, the error interception is reported.
If an abnormality is detected in step S30804, the flow proceeds to step S318; if no abnormality is detected in step S30804, the flow advances to step S310.
S310, 50% CPU load test and abnormality detection are performed for 15 minutes. The step S310 includes:
s31002, using stress-ng pressure tool, carrying out core binding pressure test on the remaining 50% cores of CPU0 for 45 minutes, combining the 25% load in the step S308, namely, carrying out pressure load by cumulatively using 50% of CPU cores of the whole double-path server, so that the CPU load of the whole server reaches 50%.
S31004, after the operation is continued for 15 minutes, detecting the abnormality again, the detecting method is the same as that of step S30804 described above.
If an abnormality is detected in step S31004, the flow proceeds to step S318; if no abnormality is detected in step S31004, the routine proceeds to step S312.
S312, 15 minutes 75% cpu load test and detect anomalies. The step S312 includes:
s31202, using stress-ng pressure tool, carrying out core binding pressure test on 50% of cores of CPU1 for 30 minutes, combining 50% load in the above steps, namely, carrying out pressure load by cumulatively using 75% of cores of CPU of the whole double-path server, so that the load of CPU of the whole server reaches 75%.
S31204, after continuing the operation for 15 minutes, detecting the abnormality again, the detection method is the same as that of step S30804 described above.
If an abnormality is detected in step S31204, the flow proceeds to step S318; if no abnormality is detected in step S31204, the flow proceeds to step S314.
S314, 100% CPU load test and abnormality detection are performed in 15 minutes. The step S314 includes:
s31402, using stress-ng pressure tool, carrying out core binding pressure test on the remaining 50% cores of CPU1 for 15 minutes, and combining 75% load in the above steps, namely, carrying out pressure load by cumulatively using 100% of CPU cores of the whole double-path server, so that the CPU load of the whole server reaches 100%.
S31404, detecting the abnormality again after the operation is continued for 15 minutes, wherein the detection method is the same as that of the step S30804.
If an abnormality is detected in step S31404, the flow proceeds to step S318; if no abnormality is detected in step S31404, the flow advances to step S316.
S316, stopping information sampling, collecting and analyzing, and if abnormality is detected in the process of gradually increasing load pressurization, checking the abnormality of the production line at the stage.
S318, ending.
The overall process load versus information sampling curve is shown in fig. 4, and fig. 4 is a graph of CPU utilization versus temperature change according to an embodiment of the present application. Fig. 4 includes a temperature change curve of the CPU0, a temperature change curve of the CPU1, and a change curve of CPU usage of the entire server, and in fig. 4, a horizontal (or horizontal) coordinate represents time (in seconds), and a vertical (or vertical) coordinate represents temperature (or usage), that is, for the temperature change curve of the CPU0 and the temperature change curve of the CPU1, the vertical coordinate represents temperature; while for the change curve of the CPU usage of the server, the vertical coordinate indicates the usage, for example, the vertical coordinate 20 indicates the usage as 20%, the vertical coordinate 50 indicates the usage as 50%, etc.
In the above embodiment, a diagnosis method of the CPU CCIX in terms of system diagnosis is provided, and by presetting a CPU register, the CCIX retrain is actively triggered with high probability, and then the purpose of rapidly exposing the CPU CCIX problem is achieved by using a CPU load tool to divide the CPU to increase the CPU load step by step, so that the CPU with problem is effectively intercepted. Through carrying out a large amount of sample collection, through the analysis to sample data, provide the reference basis for CPU manufacturer to the parameter setting tuning of CPU inside.
According to the embodiment of the application, the CPU register is set to enable CCIX retrain to be triggered at high frequency, and the CCIX problem of the CPU is rapidly exposed in a mode of respectively carrying out load gradient increment on two paths of CPUs, so that bad CPUs with the CCIX problem can be effectively screened. The server with the CPU CCIX problem is prevented from flowing into the market end, and the shipment quality is improved. In addition, when the server assembly factory is manufactured in batches, the CPU load, the CPU0 temperature and the CPU1 temperature are sampled in a concentrated mode, big data analysis is conducted on collected data, the fact that the current CPU is under what load and at what temperature is the CPU can be effectively analyzed, and the problem of CCIX is easy to occur. Provides an important reference basis for CPU manufacturers in CPU internal parameter tuning, and is beneficial to improving the CPU.
The technical scheme of the embodiment of the application can also be applied to CPU system diagnosis aspects, even CPU detection aspects, of intel, AMD and the like which use the CCIX scheme.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the embodiments of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.
In this embodiment, there is also provided a detection apparatus for a server, and fig. 5 is a structural diagram of the detection apparatus for a server according to an embodiment of the present application, as shown in fig. 5, where the apparatus includes:
the execution module 502 is configured to execute a pressure test on a group of CPUs in a server according to a manner of increasing CPU load step by step when a target parameter in a target register in the server is set to a target value, where the target value is used to control, in a process of executing the pressure test, a number of times of interconnection between different CPUs in the group of CPUs based on an accelerator interconnection CCIX protocol of cache consistency to be greater than a preset default value;
The obtaining module 504 is configured to obtain a target log generated by executing the pressure test, where the target log is configured to record log information corresponding to a CCIX interconnection fault when the CCIX interconnection fault occurs in a process of executing the pressure test;
a first determining module 506 is configured to determine, according to the target log, whether the CCIX interconnection fault occurs between different CPUs in the set of CPUs.
In an alternative embodiment, the executing module 502 includes: a first execution unit, configured to execute, in a case where a load of a second CPU is kept zero, the pressure test on the first CPU and the second CPU in a manner of increasing the load of the first CPU stepwise until the load of the first CPU is a preset first maximum value, where the set of CPUs includes the first CPU and the second CPU, and the first CPU and the second CPU are interconnected based on the CCIX protocol; and/or a second execution unit configured to execute the pressure test on the first CPU and the second CPU in such a manner that the load of the second CPU is increased stepwise until the load of the second CPU is a preset second maximum value, in a case where the load of the first CPU is maintained at the preset first maximum value.
In an alternative embodiment, the first determining module 506 includes at least one of: a first determining unit, configured to determine that the CCIX interconnection fault occurs between at least some CPUs in the set of CPUs in the case that a target field is included in the target log, where the target field is used to indicate that a hardware fault corresponding to the CCIX interconnection fault occurs in the server; and the second determining unit is used for determining that the CCIX interconnection fault occurs between at least part of the CPUs in the group of CPUs under the condition that the target log comprises target alarm information, wherein the target alarm information is used for indicating a CPU alarm corresponding to the CCIX interconnection fault.
In an alternative embodiment, the executing module 502 includes: a third execution unit configured to execute a first pressure test on a first CPU and a second CPU in a case where a load of the first CPU is set to a first preset ratio of a preset first maximum value and a load of the second CPU is set to zero, wherein the set of CPUs includes the first CPU and the second CPU, and interconnection is performed between the first CPU and the second CPU based on the CCIX protocol; a fourth execution unit configured to execute a second pressure test on the first CPU and the second CPU, in a case where a load of the first CPU is set to a second preset ratio of the first maximum value and a load of the second CPU is set to zero; a fifth execution unit configured to execute a third pressure test on the first CPU and the second CPU, in a case where a load of the first CPU is set to the second preset proportion of the first maximum value and a load of the second CPU is set to a third preset proportion of a preset second maximum value; a sixth execution unit configured to execute a fourth pressure test on the first CPU and the second CPU, in a case where a load of the first CPU is set to the second preset proportion of the first maximum value and a load of the second CPU is set to a fourth preset proportion of the second maximum value; the second preset proportion is larger than the first preset proportion, the fourth preset proportion is larger than the third preset proportion, and the pressure test comprises the first pressure test, the second pressure test, the third pressure test and the fourth pressure test.
In an alternative embodiment, the apparatus further comprises: and the second determining module is used for determining that at least one of the first CPU and the second CPU has a fault associated with the CCIX interconnection fault according to the target log under the condition that the CCIX interconnection fault is determined to occur between the first CPU and the second CPU.
In an alternative embodiment, the second determining module includes: the first acquisition unit is used for acquiring the target time of the CCIX interconnection fault recorded in the target log; a third determining unit, configured to determine that, when the target time is located in a first time interval or a second time interval, the first time interval is a time interval in which the first pressure test is performed, and the second time interval is a time interval in which the second pressure test is performed, where a maximum value of the first time interval is less than or equal to a minimum value of the second time interval, where the fault is associated with the CCIX interconnection fault; a fourth determining unit, configured to determine that, when the target time is located in a third time interval or a fourth time interval, the second CPU has a fault associated with the CCIX interconnection fault, where the third time interval is a time interval in which the third pressure test is performed, the fourth time interval is a time interval in which the fourth pressure test is performed, and a maximum value of the third time interval is less than or equal to a minimum value of the fourth time interval.
In an alternative embodiment, the apparatus further comprises: and a third determining module, configured to determine, according to the target log, fault description information when it is determined that the CCIX interconnection fault occurs between the first CPU and the second CPU, where the fault description information is used to describe that at least one of the first CPU and the second CPU has a fault associated with the CCIX interconnection fault.
In an alternative embodiment, the third determining module includes: the second acquisition unit is used for acquiring the target time of the CCIX interconnection fault recorded in the target log; a fifth determining unit, configured to determine, when the target time is located in a first time interval, the fault description information as first fault description information according to a first test result, where the first time interval is a time interval during which the first pressure test is performed, the first fault description information is used to indicate that the first CPU has the fault under a first target temperature, where the first target temperature is a temperature of the first CPU at the target time during the process of performing the first pressure test, and the first test result is a test result obtained by performing the first pressure test, and the first test result includes the target time and the first target temperature that have a corresponding relationship; a sixth determining unit, configured to determine, according to a second test result, the fault description information as second fault description information, where the second time interval is a time interval during which the second pressure test is performed, and the second fault description information is used to indicate that the first CPU has the fault in a case where a second target temperature, which is a temperature of the first CPU at the target time during the process of performing the second pressure test, is a second preset proportion of the load of the first CPU being the first maximum value, where the second test result is a test result obtained by performing the second pressure test, and the second test result includes the target time and the second target temperature that have a correspondence relationship, and a maximum value of the first time interval is less than or equal to a minimum value of the second time interval; a seventh determining unit configured to determine, in a case where the target time is located in a third time interval, the fault description information as third fault description information according to a third test result, where the third time interval is a time interval during which the third pressure test is performed, the third fault description information is used to indicate that the second CPU has the fault in a case where a third target temperature, which is a temperature of the second CPU at the target time during the execution of the third pressure test, is a third preset proportion of the second CPU in which a load of the second CPU is the second maximum value, is present, and the third test result is a test result obtained by executing the third pressure test, where the third test result includes the target time and the third target temperature having a correspondence relationship; an eighth determining unit, configured to determine, according to a fourth test result, the fault description information as fourth fault description information, where the fourth time interval is a time interval in which the fourth pressure test is performed, the fourth fault description information is used to indicate that the second CPU has the fault in a case where a fourth target temperature, which is a temperature of the second CPU at the target time during the execution of the fourth pressure test, is the fourth preset proportion in which a load of the second CPU is the second maximum value, and the fourth test result is a test result obtained by performing the fourth pressure test, where the fourth test result includes the target time and the fourth target temperature having a correspondence relationship, and a maximum value of the third time interval is less than or equal to a minimum value of the fourth time interval.
In an alternative embodiment, the apparatus further comprises: and the setting module is used for responding to the acquired setting instruction and setting the target parameter in the target register as the target value before the pressure test is executed on a group of CPUs in the server in a mode of increasing the CPU load step by step.
It should be noted that each of the above units or modules may be implemented by software or hardware, and for the latter, may be implemented by, but not limited to: the units or modules are all located in the same processor; alternatively, each of the units or modules described above may be located in a different processor in any combination.
Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
In one exemplary embodiment, the computer readable storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.
An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
In an exemplary embodiment, the electronic apparatus may further include a transmission device connected to the processor, and an input/output device connected to the processor.
Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the modules or steps of the embodiments of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than what is shown or described, or they may be separately fabricated into individual integrated circuit modules, or a plurality of modules or steps in them may be fabricated into a single integrated circuit module. Thus, embodiments of the application are not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present application and is not intended to limit the embodiments of the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the embodiments of the present application should be included in the protection scope of the embodiments of the present application.

Claims (11)

1. A method for detecting a server, comprising:
under the condition that target parameters in a target register in a server are set as target values, executing pressure test on a group of CPUs in the server in a mode of increasing CPU load step by step, wherein the target values are used for controlling the interconnection times of different CPUs in the group of CPUs based on the cache consistency to be larger than a preset default value in the process of executing the pressure test;
obtaining a target log generated by executing the pressure test, wherein the target log is used for recording log information corresponding to a CCIX interconnection fault under the condition that the CCIX interconnection fault occurs in the process of executing the pressure test;
and determining whether the CCIX interconnection fault occurs between different CPUs in the set of CPUs according to the target log.
2. The method of claim 1, wherein performing a stress test on a set of CPUs in the server in a stepwise increasing CPU load comprises:
executing the pressure test on a first CPU and a second CPU in a manner of gradually increasing the load of the first CPU until the load of the first CPU is a preset first maximum value under the condition that the load of the second CPU is kept to be zero, wherein the group of CPUs comprises the first CPU and the second CPU, and the first CPU and the second CPU are interconnected based on the CCIX protocol; and/or
And executing the pressure test on the first CPU and the second CPU in a manner of gradually increasing the load of the second CPU until the load of the second CPU is a preset second maximum value under the condition that the load of the first CPU is kept to be the preset first maximum value.
3. The method of claim 1, wherein said determining from said target log whether said CCIX interconnection fault occurred between different CPUs in said set of CPUs comprises at least one of:
determining that the CCIX interconnection fault occurs between at least some CPUs in the set of CPUs in the case that a target field is included in the target log, wherein the target field is used for indicating that a hardware fault corresponding to the CCIX interconnection fault occurs in the server;
And under the condition that the target log comprises target alarm information, determining that the CCIX interconnection fault occurs among at least part of the CPUs in the group of CPUs, wherein the target alarm information is used for indicating CPU alarms corresponding to the CCIX interconnection fault.
4. The method of claim 1, wherein performing a stress test on a set of CPUs in the server in a stepwise increasing CPU load comprises:
performing a first pressure test on a first CPU and a second CPU with a load of the first CPU set to a first preset proportion of a preset first maximum value and a load of the second CPU set to zero, wherein the set of CPUs includes the first CPU and the second CPU, and the first CPU and the second CPU are interconnected based on the CCIX protocol;
performing a second pressure test on the first CPU and the second CPU with the load of the first CPU set to a second preset proportion of the first maximum value and the load of the second CPU set to zero;
performing a third pressure test on the first CPU and the second CPU with the load of the first CPU set to the second preset proportion of the first maximum value and the load of the second CPU set to a third preset proportion of a preset second maximum value;
Performing a fourth pressure test on the first CPU and the second CPU with the load of the first CPU set to the second preset proportion of the first maximum value and the load of the second CPU set to a fourth preset proportion of the second maximum value;
the second preset proportion is larger than the first preset proportion, the fourth preset proportion is larger than the third preset proportion, and the pressure test comprises the first pressure test, the second pressure test, the third pressure test and the fourth pressure test.
5. The method according to claim 4, wherein the method further comprises:
and under the condition that the CCIX interconnection fault occurs between the first CPU and the second CPU, determining that at least one of the first CPU and the second CPU has the fault associated with the CCIX interconnection fault according to the target log.
6. The method according to claim 4, wherein the method further comprises:
and under the condition that the CCIX interconnection fault occurs between the first CPU and the second CPU, determining fault description information according to the target log, wherein the fault description information is used for describing that at least one of the first CPU and the second CPU has a fault associated with the CCIX interconnection fault.
7. The method of claim 6, wherein said determining fault description information from said target log comprises:
acquiring target time recorded in the target log, wherein the target time is recorded in the target log and is used for generating the CCIX interconnection fault;
determining the fault description information as first fault description information according to a first test result when the target time is in a first time interval, wherein the first time interval is a time interval for executing the first pressure test, the first fault description information is used for indicating that the first CPU has the fault under the condition that a first target temperature is the temperature of the first CPU at the target time in the process of executing the first pressure test and the load of the first CPU is the first preset proportion of the first maximum value, and the first test result is a test result obtained by executing the first pressure test, and the first test result comprises the target time and the first target temperature with corresponding relations;
determining the fault description information as second fault description information according to a second test result when the target time is located in a second time interval, wherein the second time interval is a time interval for executing the second pressure test, the second fault description information is used for indicating that the first CPU fails under the condition that a second target temperature and the load of the first CPU are in a second preset proportion of the first maximum value, the second target temperature is a temperature of the first CPU at the target time in the process of executing the second pressure test, the second test result is a test result obtained by executing the second pressure test, and the second test result comprises the target time and the second target temperature with a corresponding relation, and the maximum value of the first time interval is smaller than or equal to the minimum value of the second time interval;
Determining the fault description information as third fault description information according to a third test result when the target time is in a third time interval, wherein the third time interval is a time interval for executing the third pressure test, the third fault description information is used for indicating that the second CPU has the fault under the condition that a third target temperature is a temperature of the second CPU at the target time in the process of executing the third pressure test, and the third test result is a test result obtained by executing the third pressure test, and the third test result comprises the target time and the third target temperature with corresponding relations;
and determining the fault description information as fourth fault description information according to a fourth test result when the target time is in a fourth time interval, wherein the fourth time interval is a time interval for executing the fourth pressure test, the fourth fault description information is used for indicating that the fault occurs to the second CPU under the condition that a fourth target temperature is a fourth preset proportion of the load of the second CPU to the second maximum value, the fourth target temperature is a temperature of the second CPU on the target time in the process of executing the fourth pressure test, the fourth test result is a test result obtained by executing the fourth pressure test, and the fourth test result comprises the target time and the fourth target temperature with a corresponding relation, and the maximum value of the third time interval is smaller than or equal to the minimum value of the fourth time interval.
8. The method of any of claims 1 to 7, wherein prior to said performing a stress test on a set of CPUs in the server in a stepwise increasing CPU load, the method further comprises:
and setting the target parameter in the target register to the target value in response to the acquired setting instruction.
9. A detection apparatus for a server, comprising:
the execution module is used for executing pressure test on a group of CPUs in the server in a mode of increasing CPU load step by step under the condition that target parameters in a target register in the server are set as target values, wherein the target values are used for controlling the number of times of interconnection between different CPUs in the group of CPUs based on the cache consistency of an accelerator interconnection CCIX protocol to be larger than a preset default value in the process of executing the pressure test;
the acquisition module is used for acquiring a target log generated by executing the pressure test, wherein the target log is used for recording log information corresponding to the CCIX interconnection fault under the condition that the CCIX interconnection fault occurs in the process of executing the pressure test;
And the first determining module is used for determining whether the CCIX interconnection fault occurs between different CPUs in the group of CPUs according to the target log.
10. A computer readable storage medium, characterized in that a computer program is stored in the computer readable storage medium, wherein the computer program, when being executed by a processor, implements the steps of the method according to any of the claims 1 to 8.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.
CN202310778073.2A 2023-06-28 2023-06-28 Method and device for detecting server, storage medium and electronic device Pending CN116795648A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310778073.2A CN116795648A (en) 2023-06-28 2023-06-28 Method and device for detecting server, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310778073.2A CN116795648A (en) 2023-06-28 2023-06-28 Method and device for detecting server, storage medium and electronic device

Publications (1)

Publication Number Publication Date
CN116795648A true CN116795648A (en) 2023-09-22

Family

ID=88034254

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310778073.2A Pending CN116795648A (en) 2023-06-28 2023-06-28 Method and device for detecting server, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN116795648A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117112370A (en) * 2023-10-24 2023-11-24 四川华鲲振宇智能科技有限责任公司 Method and system for acquiring power consumption of server

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117112370A (en) * 2023-10-24 2023-11-24 四川华鲲振宇智能科技有限责任公司 Method and system for acquiring power consumption of server

Similar Documents

Publication Publication Date Title
US10929260B2 (en) Traffic capture and debugging tools for identifying root causes of device failure during automated testing
US20210377102A1 (en) A method and system for detecting a server fault
US11010273B2 (en) Software condition evaluation apparatus and methods
US20190353696A1 (en) Smart and efficient protocol logic analyzer configured within automated test equipment (ate) hardware
US20150100296A1 (en) Method and system for automated test and result comparison
CN108508874B (en) Method and device for monitoring equipment fault
US7577876B2 (en) Debug system for data tracking
CN116795648A (en) Method and device for detecting server, storage medium and electronic device
US8601318B2 (en) Method, apparatus and computer program product for rule-based directed problem resolution for servers with scalable proactive monitoring
CN109918221B (en) Hard disk error reporting analysis method, system, terminal and storage medium
US7475164B2 (en) Apparatus, system, and method for automated device configuration and testing
CN109407655A (en) A kind of method and device for debugging chip
CN117271234A (en) Fault diagnosis method and device, storage medium and electronic device
US20060294424A1 (en) Debug port system for control and observation
CN116886490A (en) Server inspection method and device and computer readable storage medium
CN115766526A (en) Test method and device for switch physical layer chip and electronic equipment
JP2012150661A (en) Processor operation inspection system and its inspection method
CN111654401B (en) Network segment switching method, device, terminal and storage medium of monitoring system
CN112463504B (en) Double-control storage product testing method, system, terminal and storage medium
CN111639022A (en) Transaction testing method and device, storage medium and electronic device
CN116719712B (en) Processor serial port log output method and device, electronic equipment and storage medium
CN117407207B (en) Memory fault processing method and device, electronic equipment and storage medium
CN116382968B (en) Fault detection method and device for external equipment
CN116915583B (en) Communication abnormality diagnosis method, device and electronic equipment
CN114741228A (en) Computer mainboard fault diagnosis method, system and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination