CN115827355B - Method and device for detecting abnormal core in graphics processor and electronic equipment - Google Patents

Method and device for detecting abnormal core in graphics processor and electronic equipment Download PDF

Info

Publication number
CN115827355B
CN115827355B CN202310030933.4A CN202310030933A CN115827355B CN 115827355 B CN115827355 B CN 115827355B CN 202310030933 A CN202310030933 A CN 202310030933A CN 115827355 B CN115827355 B CN 115827355B
Authority
CN
China
Prior art keywords
hardware processing
processing unit
test case
case data
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310030933.4A
Other languages
Chinese (zh)
Other versions
CN115827355A (en
Inventor
江靖华
梁存旭
张坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenliu Micro Intelligent Technology Shenzhen Co ltd
Original Assignee
Shenliu Micro Intelligent Technology Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenliu Micro Intelligent Technology Shenzhen Co ltd filed Critical Shenliu Micro Intelligent Technology Shenzhen Co ltd
Priority to CN202310030933.4A priority Critical patent/CN115827355B/en
Publication of CN115827355A publication Critical patent/CN115827355A/en
Application granted granted Critical
Publication of CN115827355B publication Critical patent/CN115827355B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the application discloses a method, a device and electronic equipment for detecting abnormal hardware in a graphics processor, which can effectively avoid blocking caused by abnormal hardware processing units when the graphics processor runs, and timely enable the hardware processing units to resume the working state when the hardware processing units are restored to be normal, thereby improving the task processing efficiency of the graphics processor. The detection method comprises the following steps: initializing the graphic processor when the graphic processor is powered on; if the initialization is normal, the main control core is controlled to use test case data, whether the hardware processing unit in the graphic processor runs abnormally or not is periodically detected, wherein the test case data is used for detecting the running state of the hardware processing unit; if the hardware processing unit runs abnormally, controlling the main control core to suspend task dispatch to the hardware processing unit; and when the hardware processing unit is detected to be operated normally in any subsequent period, controlling the main control core to resume task dispatch to the hardware processing unit.

Description

Method and device for detecting abnormal core in graphics processor and electronic equipment
Technical Field
The present disclosure relates to the field of microelectronics technologies, and in particular, to a method and an apparatus for detecting abnormal hardware in a graphics processor, and an electronic device.
Background
In the field of graphics processing technology, graphics processors (graphics processing unit, GPUs) are assuming a role of processing huge amounts of data, and the capability of processing data is continuously increasing. The test scheme adopted by the GPU is tested by automatic test equipment (automatic test equipment, ATE) of a sealing and testing factory, and a series of test links such as functional test and reliability test are performed.
In the current testing links, in most defect inspection links, the GPU is screened mainly through perspective and appearance, and the functional characteristics of the GPU are not detected. The GPU after passing the test is abnormal due to the use environment or software problems, so that the current task cannot be completed due to the abnormality of the hardware processing unit during the running of the GPU, and the blockage is formed in the hardware processing unit (namely, abnormal core).
Disclosure of Invention
Based on the above, it is necessary to address the above problem, and the present application provides a method, an apparatus, and an electronic device for detecting an abnormal core in a graphics processor, which can effectively avoid blocking caused by an abnormality in a hardware processing unit when the graphics processor runs, and timely restore the hardware processing unit to a working state when the hardware processing unit is restored to be normal, thereby improving task processing efficiency of the graphics processor. It is easy to understand that the hardware processing unit with the exception is the exception core.
In a first aspect, the present application provides a method for detecting an abnormal core in a graphics processor, where the graphics processor includes a main control core and a hardware processing unit, and the method includes:
initializing the graphic processor when the graphic processor is powered on;
if the initialization is normal, the main control core is controlled to use test case data, whether the hardware processing unit in the graphic processor runs abnormally or not is periodically detected, wherein the test case data is used for detecting the running state of the hardware processing unit;
if the hardware processing unit runs abnormally, controlling the main control core to suspend task dispatch to the hardware processing unit;
and when the hardware processing unit is detected to be operated normally in any subsequent period, controlling the main control core to resume task dispatch to the hardware processing unit.
Optionally, in some implementations of the first aspect, an exception flag is set for the hardware processing unit; the detection method further comprises the following steps:
when the hardware processing unit runs abnormally, the control main control core marks the abnormality mark of the hardware processing unit as abnormal;
when the operation of the hardware processing unit is recovered to be normal, the main control core is controlled to mark the abnormality mark of the hardware processing unit as normal.
Optionally, in some implementations of the first aspect, the test case data includes: integral test case data and individual test case data; the whole test case data is the hardware processing unit type which is not distinguished, and can test all types of hardware processing units; the individual test case data is designed for a certain type of hardware processing unit, and only a certain type of hardware processing unit can be tested.
Optionally, in some implementations of the first aspect, the graphics processor further includes: a dispatcher, a decision maker and a result comparator; the control main control core uses the test case data to periodically detect whether the hardware processing unit in the graphics processor runs abnormally, and the control main control core comprises the following steps:
the control main control core preferentially dispatches the test case data to the dispatcher;
the control dispatcher dispatches a request task to the hardware processing unit according to the test case data, sends a request task index table corresponding to the request task to the decision maker, and sends a task correct result corresponding to the request task to the result comparator;
the control hardware processing unit processes the request task according to the test case data and sends the obtained task processing result to the result comparator;
The control result comparator compares the correct result of the task with the task processing result to obtain a comparison result, and sends the comparison result to the decision maker;
and the control decision-making device determines whether the hardware processing unit is abnormal according to the comparison result and updates the request task index table.
Optionally, in some implementations of the first aspect, the controlling the decision maker to determine whether the hardware processing unit is abnormal according to the comparison result includes:
if the comparison results are equal, determining that the hardware processing unit operates normally;
if the comparison result is unequal, determining that the hardware processing unit is abnormal in operation.
Optionally, in some implementations of the first aspect, before controlling the main control core to periodically detect whether the hardware processing unit in the graphics processor is abnormal using the test case data, the method further includes:
the method comprises the steps of controlling a main control core to send a data request signal to a CPU of a host processor to obtain test case data, wherein the test case data are data which are stored on a CPU side and can be dynamically called;
or the control main control core acquires the memory address of the detection firmware, wherein the test case data is packed in the detection firmware, and the detection firmware is loaded into the memory of the graphic processor.
Optionally, in some implementations of the first aspect, when the test case data is packaged in the detection firmware, the loading manner in which the detection firmware is loaded into the memory of the graphics processor includes: PCI channel loading, JTAG channel loading, and Flash power-on loading.
In a second aspect, the present application provides a device for detecting abnormal hardware in a graphics processor, where the graphics processor includes a main control core and a hardware processing unit; the detection device comprises:
the system comprises an initialization module, an abnormality detection module and a task dispatch module;
the initialization module is used for: initializing the graphic processor when the graphic processor is powered on;
the abnormality detection module is used for: if the initialization is normal, the main control core is controlled to use test case data, whether the hardware processing unit in the graphic processor runs abnormally or not is periodically detected, wherein the test case data is used for detecting the running state of the hardware processing unit;
the task dispatch module is used for: if the hardware processing unit runs abnormally, controlling the main control core to suspend task dispatch to the hardware processing unit;
the task dispatch module is further configured to: and when the hardware processing unit is detected to be operated normally in any subsequent period, controlling the main control core to resume task dispatch to the hardware processing unit.
In a third aspect, the present application provides an electronic device, including:
a memory and a processor, wherein the memory has executable code stored thereon;
when the executable code is invoked by a processor, the electronic device is caused to perform the steps of the method of detecting abnormal hardware in a graphics processor as claimed in any one of the first aspect and its implementation forms.
In a fourth aspect, the present application provides a computer readable storage medium having executable code stored thereon, which when invoked by a processor of an electronic device, causes the electronic device to perform the steps in the method for detecting abnormal hardware in a graphics processor according to any one of the first aspect and its implementation forms.
The technical scheme that this application provided has following beneficial effect:
in the technical scheme, the main control core is controlled to use test case data to periodically detect whether a hardware processing unit in the graphics processor runs abnormally or not; if the hardware processing unit is abnormal, the main control core is controlled to suspend task dispatch to the hardware processing unit, so that the abnormal condition of the hardware processing unit can be found in time, task dispatch to the abnormal hardware processing unit is suspended in time, when the abnormal hardware processing unit detects that the operation of the hardware processing unit is recovered to be normal in any subsequent period, task dispatch to the hardware processing unit is recovered, blocking caused by the abnormal condition of the hardware processing unit when the graphics processor is operated can be effectively avoided, and the hardware processing unit is recovered to be in time in the working state when the hardware processing unit is recovered to be normal, so that the task processing efficiency of the graphics processor is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the application.
Fig. 1 is a schematic view of an application scenario in an embodiment of the present application;
FIG. 2 is a flow chart of a method for detecting abnormal hardware in a graphics processor according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a device for detecting abnormal hardware in a graphics processor according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
In order to facilitate understanding of the technical solution in the embodiments of the present application, the following describes application scenarios in the embodiments of the present application, specifically as follows:
fig. 1 is a schematic diagram of an application scenario in an embodiment of the present application.
As shown in fig. 1, the application scenario includes: the PCI device comprises a host and PCI equipment, wherein the host comprises a CPU (Central processing Unit) of a processor and a DDR (double data rate) of a memory, and a data transmission channel is established between the CPU and the DDR; the PCI device comprises a graphic processor GPU and a video memory GDDR, and likewise, a data transmission channel is established between the GPU and the GDDR, and the GPU and the CPU are connected through a PCI bus.
In the above application scenario, after the GPU and CPU are powered on, the BIOS system of HOST scans the PCI device and matches it with the appropriate driver. After successful matching, HOST and PCI equipment have the function of PCI protocol communication. The GPU provides hardware information such as equipment resources and states to HOST, and the HOST CPU operates and adjusts the working state of the GPU equipment according to the grasped hardware information of the GPU, so that necessary firmware and hardware acceleration tasks are loaded for the GPU.
When one or more hardware processing units in the GPU are put into an inoperable state, not only are the tasks completed blocked, but also the next task cannot be participated, the overall operation capability of the GPU is reduced, and in this case, restarting the GPU is not desirable. Therefore, a mechanism for dynamically detecting whether the hardware processing unit is operating normally is necessary to ensure that the GPU is operating and not restarted.
Aiming at the technical problems, the embodiment of the application provides a method for detecting abnormal hardware in a graphics processor.
FIG. 2 is a flow chart of a method for detecting abnormal hardware in a graphics processor according to an embodiment of the present application.
As shown in fig. 2, a method for detecting abnormal hardware in a graphics processor according to an embodiment of the present application includes:
201. when the graphics processor is powered up, the graphics processor is initialized.
In the embodiment of the application, when the graphics processor is powered on, the GPU is initialized, all hardware processing units are initialized, and related hardware devices are initialized. And updating the state of the GPU to HOST by the GPU after the initialization is completed, and accessing the state of the GPU by HOST through a PCI bus. The GPU with successful initialization updates the GPU state to ACTIVE (1) in a data structure describing the state of the GPU, and the initialization failure is UNACTIVE (0).
Optionally, the graphics processor further includes a timer, where the timer is used to set a period duration of the hardware processing unit in the graphics processor for periodic detection, and when the graphics processor is powered on, the timer is initialized, and when the periodic detection starts in step 202, the timer resets the timing of the timer to zero, and is specially used for timing during the periodic detection from the new start.
202. If the initialization is normal, the control main control core uses test case data to periodically detect whether the hardware processing unit in the graphics processor runs abnormally, wherein the test case data is used for detecting the running state of the hardware processing unit.
In the implementation of the present application, the initialization of the graphics processor is mainly to initialize a main control core in the graphics processor, if the initialization is normal, it indicates that the main control core is normal, and if the initialization is abnormal, it indicates that the main control core has a problem, that is, it detects an abnormality of the main core.
The test case data is functional data including a function of detecting an operation state of the hardware processing unit. The test case data includes: integral test case data and individual test case data; the whole test case data is the hardware processing unit type which is not distinguished, and can test all types of hardware processing units; the individual test case data is designed for a certain type of hardware processing unit, and only a certain type of hardware processing unit can be tested. For example, the individual test case data includes: test case class C data, test case class B data, and test case class A data.
Further, the manner in which the test case exists includes, but is not limited to: 1. static form, packed in detection firmware; 2. the dynamic form is stored on the HOST side and is called at any time. The static test case can be tested quickly or regularly at the starting time, the dynamic test case is called irregularly, and the dynamic test case at the CPU side is called and executed when the GPU meets the condition of detecting the hardware processing unit.
The dynamic form of the test cases is rich in types, such as the test case of a vertex shader, the test case of a tessellation shader, the test case of a geometry shader, and so on. The method can detect a certain hardware processing unit, uniformly store the hardware processing unit on the HOST side, and dynamically call related test cases according to test requirements. When the test cases stored on HOST side find that the test cases are insufficient to detect the hardware processing unit in detection, the test cases are dynamically selected to be distributed according to feedback or the generated cases are dynamically compiled.
The dynamic test case comprises an overall test and an individual test. The whole test case is the same as the test case packed in the firmware, and all hardware processing units can be detected. Separate test cases are developed for each type of hardware processing unit and can be invoked dynamically. The dynamic form test case comprises but is not limited to hierarchical test, unified detection is performed first, and an abnormality is detected and belongs to the first layer. And detecting according to the abnormal hardware processing unit type and marked hardware processing units, so that the conflict of retesting of other normal hardware processing units is avoided. Testing may also be performed according to a certain hardware processing unit type.
Optionally, the dynamic form is stored on the HOST side, and the specific calling mode of the test case data which is called at any time is as follows: and controlling a data request signal sent by the main control core to the CPU of the host processor to acquire test case data, wherein the test case data is data which is stored in the CPU side and can be dynamically called.
Optionally, in a static form, the specific calling mode of the test case data packaged in the detection firmware is as follows: the control main control core acquires the memory address of the detection firmware, wherein the test case data is packed in the detection firmware, and the detection firmware is loaded into the memory of the graphic processor.
Still further optionally, when the test case data is packaged in the detection firmware, the loading manner of the detection firmware into the memory of the graphics processor includes: PCI channel loading, JTAG channel loading, and Flash power-on loading. Three ways of loading the detection firmware will be described in detail below.
In the present embodiment, the hardware processing units include, but are not limited to, vertex shaders, tessellation shaders, geometry shaders, fragment shaders, and the like.
In this embodiment of the present application, the control main control core uses test case data to periodically detect whether a hardware processing unit in the graphics processor is abnormal, and specifically may perform the following operations:
The control main control core preferentially dispatches the test case data to the dispatcher;
the control dispatcher dispatches a request task to the hardware processing unit according to the test case data, sends a request task index table corresponding to the request task to the decision maker, and sends a task correct result corresponding to the request task to the result comparator;
the control hardware processing unit processes the request task according to the test case data and sends the obtained task processing result to the result comparator;
the control result comparator compares the correct result of the task with the task processing result to obtain a comparison result, and sends the comparison result to the decision maker;
and the control decision-making device determines whether the hardware processing unit is abnormal according to the comparison result and updates the request task index table.
It should be noted that, the request task index table may feed back the states of all hardware processing units, and both the dispatcher and the decision maker may access the request task index table; the request task index list is used as a main control core task dispatch basis, and the main control core dispatches tasks according to the request task index list, so that dispatching of tasks to abnormal hardware processing units can be effectively avoided.
Further optionally, the control decision-maker determines whether the hardware processing unit is abnormal according to the comparison result, including:
If the comparison results are equal, determining that the hardware processing unit operates normally;
if the comparison result is unequal, determining that the hardware processing unit is abnormal in operation.
The abnormal operation of the hardware processing unit comprises the following two conditions: 1. the hardware processing unit executes and processes the task, but the task processing result obtained after the processing is completed is incorrect; 2. when the task is overtime, the hardware processing unit does not process the task, and the comparison results are unequal.
203. And if the hardware processing unit runs abnormally, controlling the main control core to suspend task dispatch to the hardware processing unit.
In the embodiment of the application, task dispatch to the abnormal hardware processing unit is stopped, so that the problem that tasks cannot be executed in the abnormal hardware processing unit, so that tasks are backlogged and the overall performance of the graphics processor is reduced can be avoided.
Optionally, in the detection method of the embodiment of the present application, a corresponding abnormality flag is set in advance for each hardware processing unit, for example, the abnormality flag is 1, which indicates that the hardware processing unit is abnormal; the exception flag is 0, which indicates that the hardware processing unit is normal, and the specific detection method further comprises the following steps:
when the hardware processing unit runs abnormally, the control main control core marks the abnormality mark of the hardware processing unit as abnormal; for example, the exception flag of the hardware processing unit is set to 1;
When the operation of the hardware processing unit is recovered to be normal, the main control core is controlled to mark the abnormality mark of the hardware processing unit as normal; for example, the exception flag of the hardware processing unit is set to 0.
204. And when abnormal hardware processing unit operation is detected to be recovered to be normal in any subsequent period, controlling the main control core to recover to dispatch tasks to the hardware processing unit.
In the embodiment of the application, when the abnormal hardware processing unit detects that the abnormal hardware processing unit runs normally in a subsequent detection period, the main control core resumes task dispatch to the hardware processing unit.
The hardware processing unit which is recovered from the abnormality to the normal state can be recovered to the normal task dispatch work in time through the periodical detection, so that the task processing efficiency of the graphic processor and the overall performance of the graphic processor are improved.
In the technical scheme, the main control core is controlled to use test case data to periodically detect whether a hardware processing unit in the graphics processor runs abnormally or not; if the hardware processing unit is abnormal, the main control core is controlled to suspend task dispatch to the hardware processing unit, so that the abnormal condition of the hardware processing unit can be found in time, task dispatch to the abnormal hardware processing unit is suspended in time, when the abnormal hardware processing unit detects that the operation of the hardware processing unit is recovered to be normal in any subsequent period, task dispatch to the hardware processing unit is recovered, blocking caused by the abnormal condition of the hardware processing unit when the graphics processor is operated can be effectively avoided, and the hardware processing unit is recovered to be in time in the working state when the hardware processing unit is recovered to be normal, so that the task processing efficiency of the graphics processor is improved.
The method for detecting abnormal hardware in the graphics processor is used in the graphics processor, and the graphics processor comprises a timer, a hardware processing unit, a result comparator, a decision maker, a dispatcher and other related hardware units.
In the above-mentioned graphics processor, the method for detecting abnormal hardware in the graphics processor in the present application may be roughly divided into the following steps:
step 1: and initializing the GPU, initializing a timer, initializing all hardware processing units and initializing related hardware equipment.
The step 1 specifically comprises the following steps: and initializing the GPU, initializing a timer, initializing all hardware processing units and initializing related hardware equipment. And updating the state of the GPU to HOST by the GPU after the initialization is completed, and accessing the state of the GPU by HOST through a PCI bus. The GPU with successful initialization updates the GPU state to ACTIVE (1) in a data structure describing the state of the GPU, and the initialization failure is UNACTIVE (0).
The data type describing the GPU state is of the bool type, and a register (gpu_status register) associated with the gpu_status is defined in a register segment commonly accessed by the CPU and the GPU, and the state of the GPU is updated by the gpu_status register. The CPU and the GPU have readable and writable rights to the gpus_status register. After the CPU and the GPU are successfully communicated, the CPU writes in UNACTIVE to the gpus_status register, and when the initialization of the GPU master control core is successful, the gpus_status register is updated to be ACTIVE. And (3) until the CPU reads that the state of the gpus_status register is ACTIVE, entering the next step, otherwise judging that the GPU master control core is abnormal.
Step 2: and judging whether the dynamic detection time condition of the hardware processing unit is met, and executing rendering tasks or requesting task dispatch.
The step 2 specifically comprises the following steps: and judging whether the dynamic detection time condition of the hardware processing unit is met. Under the condition that the dynamic detection time condition is not met, the GPU master control core starts to dispatch the dispatcher, and sends tasks to all hardware processing units to manage task data backup and release. The dispatcher follows a strategy with low power consumption and idle priority, and performs task dispatching according to decision information of the decision maker. When the dynamic detection time condition is met, the GPU master control core acquires the address of the detection firmware, and the address is preferentially scheduled to the dispatcher, the dispatcher dispatches the request task to all hardware processing units, and meanwhile the dispatcher forms a request task index list to inform the decision maker that the request task is sent out. Since the previous task has not been processed, the requested task needs to be queued until the processing of the previous task is completed. After the task is completed, the dispatcher releases the task data backup.
Step 3: the hardware processing unit processes the requested task.
The step 3 specifically comprises the following steps: the hardware processing unit processes the requested task. The request task enters into the hardware processing unit, the dispatcher sends the corresponding correct value to the result comparator, the hardware processing unit starts to fetch the instruction address according to the request task, decodes the instruction code, executes the instruction, and outputs the result to the comparator after the instruction is executed. The request task can be set to be a simpler operation task or a complex operation task. The correct value corresponding to each request task is stored in a specific address or is bound in the request task. When the request task is a dispatch object, the dispatcher searches the position corresponding to the request task to select the correct value matched with the request task. The correct value may be a random number, or a preselected low occurrence number, so as to avoid false positive (false positive) occurrences. For example, "1+2" is used as a request task and is distributed to the hardware processing unit a, and after the hardware processing unit a processes the request task, the actual value obtained by operation is transmitted to a result comparator of the hardware processing unit a as a first input; when "1+2" is the requested task, the dispatcher finds the set correct value "3", and dispatches the correct value "3" as a second input to the result comparator of the hardware processing unit a.
Step 4: the result comparator compares the actual value with the correct value and outputs a comparison result.
The step 4 specifically comprises the following steps: the result comparator compares the actual value with the correct value and outputs a comparison result.
The comparison result is passed as input to the decision maker. The comparison results are divided into "0" and "1", wherein "0" represents the difference between the comparison results, and "1" represents the same between the comparison results. For example, after the "1+2" is operated by the hardware processing unit a, the actual value is output as "3", and is compared with the correct value "3" in the result comparator, the values are equal, the result is the same, and the comparison result is "1". If the output actual value is not "3", the comparison result is "0" when the output actual value is compared with the correct value "3". Some hardware processing units may get stuck due to high temperature or software problems, and cannot process the request task, and the comparison result is "0".
Step 5: the decision maker receives the comparison results from different hardware processing units, and fills in the corresponding comparison results according to the request task index list which the dispatcher informs the decision maker.
The step 5 specifically comprises the following steps: the decision maker collects the comparison results from different hardware processing units, and fills in the corresponding comparison results according to the request task index list notified by the dispatcher to the decision maker. The request task index list may feed back the status of all hardware processing units, which may be accessed by both the dispatcher and the decision maker. And the decision maker judges the state of the hardware processing unit according to the comparison result in the request task index list. The comparison result is 1, which indicates that the hardware processing unit is normal; the comparison result is "0", the hardware processing unit is abnormal, and the decision maker marks the abnormal hardware processing unit and informs the dispatcher of the state of the current hardware processing unit.
Step 6: dispatcher combines the state of hardware processing unit to process task dispatch
The step 6 specifically comprises the following steps: the dispatcher combines the states of the hardware processing units to process task dispatching. The dispatcher continues to dispatch tasks to the normal hardware processing units. According to specific requirements, task backups are distributed to other hardware processing units or discarded, and then reset operation is carried out on the abnormal hardware processing units. The dispatcher disables the hardware processing unit marked as abnormal by the decision maker, does not dispatch tasks to the hardware processing unit, and goes to step 2 to wait for the next detection request task to arrive, then judges whether to resume the work according to the actual comparison result, and rejoins the work queue to dispatch the tasks to the work queue. The decision maker collects the comparison results from different hardware processing units, searches the hardware processing unit request task index matched with the result information according to the information transmitted to the decision maker by each hardware processing unit, converts the comparison results into the states of the hardware processing units, updates the states into a request task index list, records the latest state of each hardware processing unit in the list one by one through the decision maker, informs the dispatcher to read a new task index list after updating, and finally resets the hardware processing units with abnormal states. The dispatcher receives the updating signal of the decision maker, reads the new request task index list, obtains the latest hardware processing unit state in the request task index list, and forms a new hardware processing unit dispatching catalog in the dispatcher. And then, selectively dispatching tasks in combination with rendering requirements, continuing dispatching tasks to the normal hardware processing units, transferring tasks to the abnormal hardware processing units, and temporarily prohibiting the use of the hardware processing units.
The following test firmware represents firmware for testing the respective hardware processing units, which is compiled in the CPU. To use the firmware for the GPU, the GPU needs to be loaded from the HOST side DDR to the GDDR of the GPU, and the method for loading the detection firmware in the present application includes, but is not limited to, the following three methods: PCI channel loading, JTAG channel loading, flash power-on loading.
Wherein, PCI channel loads: the PCI device and HOST perform data interaction through a PCIe protocol, and the HOST accesses the GDDR of the GPU through an address mapping mode. The target address of the GPU side is mapped to the CPU side through the PCI space, the target address of the GPU side can be operated in the drive, and after the firmware file position and the firmware size are indicated, the firmware is written in or written in the address through DMA burst.
JTAG channel loading: after the JTAG driver is installed, HOST can monitor and operate the GDDR global address space using a USB-to-JTAG approach. When JTAG writing is operated, JTAG command is executed in the command line, the firmware path, the size and the target address of the GDDR are indicated, and after the JTAG command is executed, the firmware is loaded into the address space appointed by the GDDR through the JTAG channel from the Host side.
Flash power-on loading: and programming the firmware to the flash of the GPU through the JTAG emulator by using a programmer. After the firmware is programmed into the Flash, even if the Flash is powered off, the data stored in the Flash cannot be lost. When the system is powered on again, the GPU can directly acquire instructions and data from the flash.
Corresponding to the embodiment of the application function implementation method, the application also provides a device for detecting abnormal hardware in the graphics processor, electronic equipment and corresponding embodiments.
FIG. 3 is a schematic diagram of a device for detecting abnormal hardware in a graphics processor according to an embodiment of the present application.
As shown in fig. 3, the apparatus 30 for detecting abnormal hardware in a graphics processor according to the embodiment of the present application includes:
an initialization module 301, an anomaly detection module 302, and a task dispatch module 303;
the initialization module 301 is configured to: initializing the graphic processor when the graphic processor is powered on;
the anomaly detection module 302 is configured to: if the initialization is normal, the main control core is controlled to use test case data, whether the hardware processing unit in the graphic processor runs abnormally or not is periodically detected, wherein the test case data is used for detecting the running state of the hardware processing unit;
the task dispatch module 303 is configured to: if the hardware processing unit runs abnormally, controlling the main control core to suspend task dispatch to the hardware processing unit;
the task dispatch module 303 is further configured to: and when the hardware processing unit is detected to be operated normally in any subsequent period, controlling the main control core to resume task dispatch to the hardware processing unit.
Optionally, in some embodiments, an exception flag is set for the hardware processing unit; the anomaly detection module 302 is further configured to: when the hardware processing unit runs abnormally, the control main control core marks the abnormality mark of the hardware processing unit as abnormal; when the operation of the hardware processing unit is recovered to be normal, the main control core is controlled to mark the abnormality mark of the hardware processing unit as normal.
Optionally, in some embodiments, the test case data includes: integral test case data and individual test case data; the whole test case data is the hardware processing unit type which is not distinguished, and can test all types of hardware processing units; the individual test case data is designed for a certain type of hardware processing unit, and only a certain type of hardware processing unit can be tested, and the individual test case data comprises: test case class C data, test case class B data, and test case class A data.
Optionally, in some embodiments, the graphics processor further includes: a dispatcher, a decision maker and a result comparator; the anomaly detection module 302 is specifically configured to perform the following operations:
the control main control core preferentially dispatches the test case data to the dispatcher;
The control dispatcher dispatches a request task to the hardware processing unit according to the test case data, sends a request task index table corresponding to the request task to the decision maker, and sends a task correct result corresponding to the request task to the result comparator;
the control hardware processing unit processes the request task according to the test case data and sends the obtained task processing result to the result comparator;
the control result comparator compares the correct result of the task with the task processing result to obtain a comparison result, and sends the comparison result to the decision maker;
and the control decision-making device determines whether the hardware processing unit is abnormal according to the comparison result and updates the request task index table.
Further optionally, the anomaly detection module 302 controls the decision maker to determine whether the hardware processing unit is abnormal according to the comparison result:
if the comparison results are equal, the abnormality detection module 302 determines that the hardware processing unit is operating normally;
if the comparison result is not equal, the abnormality detection module 302 determines that the hardware processing unit is abnormal.
The abnormal operation of the hardware processing unit comprises the following two conditions: 1. the hardware processing unit executes and processes the task, but the task processing result obtained after the processing is completed is incorrect; 2. when the task is overtime, the hardware processing unit does not process the task, and the comparison results are unequal.
Optionally, in some embodiments, before the anomaly detection module 302 controls the main control core to use the test case data to periodically detect whether the hardware processing unit in the graphics processor is running abnormal, the anomaly detection module 302 is further configured to perform the following operations to obtain the test case data:
the method comprises the steps of controlling a main control core to send a data request signal to a CPU of a host processor to obtain test case data, wherein the test case data are data which are stored on a CPU side and can be dynamically called;
or the control main control core acquires the memory address of the detection firmware, wherein the test case data is packed in the detection firmware, and the detection firmware is loaded into the memory of the graphic processor.
Optionally, in some embodiments, when the test case data is packaged in the test firmware, the loading mode of the test firmware into the memory of the graphics processor includes: PCI channel loading, JTAG channel loading, and Flash power-on loading.
The specific manner in which the respective modules perform the operations and their advantageous effects have been described in detail in relation to the apparatus of the above embodiments, and will not be explained in detail here.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
As shown in fig. 4, the electronic device 40 in the embodiment of the present application includes a memory 401 and a processor 402. The memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the method of any of the embodiments described above.
The processor 402 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Memory 401 may include various types of storage units, such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 402 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 401 may include any combination of computer-readable storage media including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some embodiments, memory 401 may include readable and/or writable removable storage devices such as Compact Discs (CDs), digital versatile discs (e.g., DVD-ROMs, dual layer DVD-ROMs), blu-ray discs read only, super-density discs, flash memory cards (e.g., SD cards, min SD cards, micro-SD cards, etc.), magnetic floppy disks, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.
The memory 401 has stored thereon executable code which, when processed by the processor 402, may cause the processor 402 to perform some or all of the methods described above.
Furthermore, the method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing part or all of the steps of the above-described method of the present application.
Alternatively, the present application may also be embodied as a computer-readable storage medium (or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or electronic device, server, etc.), causes the processor to perform part or all of the steps of the above-described methods according to the present application.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application, but such implementation should not be considered to be beyond the scope of this application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms include, comprise, or any other variation is intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application.
The embodiments of the present application have been described above, the foregoing description is exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (9)

1. A method for detecting abnormal hardware in a graphic processor is characterized in that the graphic processor comprises a main control core and a hardware processing unit; the detection method comprises the following steps:
initializing the graphic processor when the graphic processor is powered on;
if the initialization is normal, controlling the main control core to use test case data to periodically detect whether a hardware processing unit in the graphics processor runs abnormally, wherein the test case data is used for detecting the running state of the hardware processing unit, and the test case data comprises: integral test case data and individual test case data; the whole test case data is used for testing all types of hardware processing units without distinguishing the types of the hardware processing units; the independent test case data is designed aiming at a certain type of hardware processing unit, and only a certain type of hardware processing unit can be tested;
if the hardware processing unit runs abnormally, controlling the main control core to suspend task dispatch to the hardware processing unit;
and when detecting that the operation of the hardware processing unit is recovered to be normal in any subsequent period, controlling the main control core to recover to dispatch tasks to the hardware processing unit.
2. The detection method according to claim 1, wherein an abnormality flag is set for the hardware processing unit; the detection method further comprises the following steps:
when the hardware processing unit runs abnormally, the main control core is controlled to mark the abnormality mark of the hardware processing unit as abnormal;
and when the operation of the hardware processing unit is recovered to be normal, controlling the main control core to mark the abnormality mark of the hardware processing unit as normal.
3. The method according to claim 1, wherein the graphics processor further comprises: a dispatcher, a decision maker and a result comparator; the controlling the main control core to periodically detect whether the hardware processing unit in the graphics processor runs abnormally by using test case data comprises the following steps:
the main control core is controlled to schedule the test case data to the dispatcher preferentially;
the dispatcher is controlled to dispatch a request task to the hardware processing unit according to the test case data, a request task index table corresponding to the request task is sent to the decision maker, and a task correct result corresponding to the request task is sent to the result comparator;
The hardware processing unit is controlled to process the request task according to the test case data, and the obtained task processing result is sent to the result comparator;
the result comparator is controlled to compare the correct result of the task with the task processing result to obtain a comparison result, and the comparison result is sent to the decision maker;
and controlling the decision maker to determine whether the hardware processing unit is abnormal according to the comparison result, and updating the request task index table.
4. The method according to claim 3, wherein the controlling the decision maker to determine whether the hardware processing unit is abnormal according to the comparison result includes:
if the comparison results are equal, determining that the hardware processing unit operates normally;
and if the comparison results are unequal, determining that the hardware processing unit runs abnormally.
5. The method according to claim 3, wherein before controlling the main control core to periodically detect whether the hardware processing unit in the graphics processor is abnormal using the test case data, further comprising:
the main control core is controlled to send a data request signal to a CPU of a host processor to acquire the test case data, wherein the test case data is data which is stored on the CPU side and can be dynamically called;
Or controlling the main control core to acquire the memory address of the detection firmware, wherein the test case data is packed in the detection firmware, and the detection firmware is loaded into the memory of the graphics processor.
6. The method of claim 5, wherein when the test case data is packaged in the test firmware, the loading manner in which the test firmware is loaded into the memory of the graphics processor comprises: PCI channel loading, JTAG channel loading, and Flash power-on loading.
7. The device for detecting abnormal hardware in the graphic processor is characterized in that the graphic processor comprises a main control core and a hardware processing unit; the detection device includes:
the system comprises an initialization module, an abnormality detection module and a task dispatch module;
the initialization module is used for: initializing the graphic processor when the graphic processor is powered on;
the abnormality detection module is used for: if the initialization is normal, controlling the main control core to use test case data to periodically detect whether a hardware processing unit in the graphics processor runs abnormally, wherein the test case data is used for detecting the running state of the hardware processing unit, and the test case data comprises: integral test case data and individual test case data; the whole test case data is used for testing all types of hardware processing units without distinguishing the types of the hardware processing units; the independent test case data is designed aiming at a certain type of hardware processing unit, and only a certain type of hardware processing unit can be tested;
The task dispatch module is used for: if the hardware processing unit runs abnormally, controlling the main control core to suspend task dispatch to the hardware processing unit;
the task dispatch module is further configured to: and when detecting that the operation of the hardware processing unit is recovered to be normal in any subsequent period, controlling the main control core to recover to dispatch tasks to the hardware processing unit.
8. An electronic device, comprising:
a memory and a processor, wherein the memory has executable code stored thereon;
when the executable code is invoked by the processor, causes the electronic device to perform the steps in the method of detecting abnormal hardware in a graphics processor as claimed in any one of claims 1-6.
9. A computer readable storage medium having stored thereon executable code which when invoked by a processor of an electronic device causes the electronic device to perform the steps of the method of detecting abnormal hardware in a graphics processor as claimed in any one of claims 1-6.
CN202310030933.4A 2023-01-10 2023-01-10 Method and device for detecting abnormal core in graphics processor and electronic equipment Active CN115827355B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310030933.4A CN115827355B (en) 2023-01-10 2023-01-10 Method and device for detecting abnormal core in graphics processor and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310030933.4A CN115827355B (en) 2023-01-10 2023-01-10 Method and device for detecting abnormal core in graphics processor and electronic equipment

Publications (2)

Publication Number Publication Date
CN115827355A CN115827355A (en) 2023-03-21
CN115827355B true CN115827355B (en) 2023-04-28

Family

ID=85520523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310030933.4A Active CN115827355B (en) 2023-01-10 2023-01-10 Method and device for detecting abnormal core in graphics processor and electronic equipment

Country Status (1)

Country Link
CN (1) CN115827355B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116820837A (en) * 2023-06-28 2023-09-29 合芯科技有限公司 Exception handling method and device for system component

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815043A (en) * 2019-01-25 2019-05-28 华为技术有限公司 Fault handling method, relevant device and computer storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101236515B (en) * 2007-01-31 2010-05-19 迈普通信技术股份有限公司 Multi-core system single-core abnormity restoration method
US8707314B2 (en) * 2011-12-16 2014-04-22 Advanced Micro Devices, Inc. Scheduling compute kernel workgroups to heterogeneous processors based on historical processor execution times and utilizations
US9513687B2 (en) * 2013-08-28 2016-12-06 Via Technologies, Inc. Core synchronization mechanism in a multi-die multi-core microprocessor
US10019576B1 (en) * 2015-04-06 2018-07-10 Intelligent Automation, Inc. Security control system for protection of multi-core processors
KR101997254B1 (en) * 2017-05-10 2019-07-08 김덕우 Computer having isolated user computing part
US20210294707A1 (en) * 2020-03-20 2021-09-23 Nvidia Corporation Techniques for memory error isolation
US20210165730A1 (en) * 2021-02-12 2021-06-03 Intel Corporation Hardware reliability diagnostics and failure detection via parallel software computation and compare

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815043A (en) * 2019-01-25 2019-05-28 华为技术有限公司 Fault handling method, relevant device and computer storage medium

Also Published As

Publication number Publication date
CN115827355A (en) 2023-03-21

Similar Documents

Publication Publication Date Title
US7730293B2 (en) Hard disk drive self-test system and method
US10713128B2 (en) Error recovery in volatile memory regions
US10387651B2 (en) Detecting a change to system management mode bios code
JP2007299404A (en) System which executes high-speed boot wake-up
CN115827355B (en) Method and device for detecting abnormal core in graphics processor and electronic equipment
JPH05210517A (en) Method of monitoring time in computer-system and computer-system
US20120042215A1 (en) Request processing system provided with multi-core processor
US8006144B2 (en) Memory testing
US8095829B1 (en) Soldier-on mode to control processor error handling behavior
US10657003B2 (en) Partial backup during runtime for memory modules with volatile memory and non-volatile memory
US9250942B2 (en) Hardware emulation using on-the-fly virtualization
US6725368B1 (en) System for executing a post having primary and secondary subsets, wherein the secondary subset is executed subsequently to the primary subset in the background setting
CN115576734B (en) Multi-core heterogeneous log storage method and system
US6971003B1 (en) Method and apparatus for minimizing option ROM BIOS code
JP2004302731A (en) Information processor and method for trouble diagnosis
JPH02294739A (en) Fault detecting system
US5974249A (en) Zero footprint method and apparatus for expanding allocated memory space of a process using a virtual memory area
JP6632416B2 (en) Shared memory control circuit and shared memory control method
CN101107591B (en) Computer system and method for activating basic program therein
US20070179635A1 (en) Method and article of manufacure to persistently deconfigure connected elements
US9208010B2 (en) Failure interval determination
JPH0766368B2 (en) Boot processor determination method
WO2016204789A1 (en) Handling errors during run time backups
CN106933558B (en) Power supply control method and device
WO2022257210A1 (en) Method and system for inspecting memory of multi-core processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant