CN115827355A - Detection method and detection device for abnormal core in graphic processor and electronic equipment - Google Patents

Detection method and detection device for abnormal core in graphic processor and electronic equipment Download PDF

Info

Publication number
CN115827355A
CN115827355A CN202310030933.4A CN202310030933A CN115827355A CN 115827355 A CN115827355 A CN 115827355A CN 202310030933 A CN202310030933 A CN 202310030933A CN 115827355 A CN115827355 A CN 115827355A
Authority
CN
China
Prior art keywords
processing unit
hardware processing
test case
controlling
main control
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310030933.4A
Other languages
Chinese (zh)
Other versions
CN115827355B (en
Inventor
江靖华
梁存旭
张坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenliu Micro Intelligent Technology Shenzhen Co ltd
Original Assignee
Shenliu Micro Intelligent Technology Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenliu Micro Intelligent Technology Shenzhen Co ltd filed Critical Shenliu Micro Intelligent Technology Shenzhen Co ltd
Priority to CN202310030933.4A priority Critical patent/CN115827355B/en
Publication of CN115827355A publication Critical patent/CN115827355A/en
Application granted granted Critical
Publication of CN115827355B publication Critical patent/CN115827355B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the application discloses a method and a device for detecting abnormal hardware in a graphics processor and electronic equipment, which can effectively avoid the blockage caused by the abnormality of a hardware processing unit when the graphics processor runs, and timely restore the hardware processing unit to a working state when the hardware processing unit is restored to be normal, thereby improving the task processing efficiency of the graphics processor. The detection method comprises the following steps: when the graphic processor is powered on, initializing the graphic processor; if the initialization is normal, controlling the main control core to use the test case data to periodically detect whether the hardware processing unit in the graphics processor is abnormal in operation, wherein the test case data is used for detecting the operation state of the hardware processing unit; if the hardware processing unit is abnormal in operation, controlling the main control core to suspend task dispatch to the hardware processing unit; and when detecting that the operation of the hardware processing unit is recovered to be normal in any subsequent period, controlling the main control core to recover to dispatch the tasks to the hardware processing unit.

Description

Detection method and detection device for abnormal core in graphic processor and electronic equipment
Technical Field
The present application relates to the field of microelectronic technologies, and in particular, to a method and an apparatus for detecting abnormal hardware in a graphics processor, and an electronic device.
Background
In the field of graphics processing technology, a Graphics Processing Unit (GPU) is playing a role of processing huge amount of data, and the capability of processing data is continuously increasing. The test scheme adopted by the GPU is subjected to a series of test links such as Automatic Test Equipment (ATE) test, functional test, reliability test, and the like in a test plant.
In the current testing links, in most defect inspection links, the GPU is mainly screened through perspective and appearance, and the functional characteristics of the GPU are not detected. Therefore, due to the use environment or software problems, the GPU after passing the test is abnormal in the hardware processing unit when the GPU runs, the current task cannot be completed, and the hardware processing unit (namely an abnormal core) is blocked.
Disclosure of Invention
Therefore, in order to solve the above problems, the present application provides a method and an apparatus for detecting an abnormal core in a graphics processor, and an electronic device, which can effectively prevent a hardware processing unit from being blocked due to an abnormality when the graphics processor runs, and timely restore the hardware processing unit to a working state when the hardware processing unit is restored to a normal state, thereby improving task processing efficiency of the graphics processor. It is easy to understand that the hardware processing unit with the exception is the exception core.
In a first aspect, the present application provides a method for detecting an abnormal core in a graphics processor, where the graphics processor includes a main control core and a hardware processing unit, and the method includes:
when the graphic processor is powered on, initializing the graphic processor;
if the initialization is normal, controlling the main control core to use the test case data to periodically detect whether the hardware processing unit in the graphic processor runs abnormally, wherein the test case data is used for detecting the running state of the hardware processing unit;
if the hardware processing unit is abnormal in operation, controlling the main control core to suspend task dispatching to the hardware processing unit;
and when detecting that the operation of the hardware processing unit is recovered to be normal in any subsequent period, controlling the main control core to recover to dispatch the tasks to the hardware processing unit.
Optionally, in some implementation manners of the first aspect, an exception flag is set for the hardware processing unit; the detection method further comprises the following steps:
when the hardware processing unit operates abnormally, controlling the main control core to mark the abnormal mark of the hardware processing unit as abnormal;
and when the hardware processing unit returns to normal operation, controlling the main control core to mark the abnormal mark of the hardware processing unit as normal.
Optionally, in some implementation manners of the first aspect, the test case data includes: the data of the integral test case and the data of the independent test case; the overall test case data is the hardware processing unit types which are not distinguished, and all types of hardware processing units can be tested; the single test case data is designed for a certain type of hardware processing unit, and only the certain type of hardware processing unit can be tested.
Optionally, in some implementations of the first aspect, the graphics processor further includes: a dispatcher, a decision maker and a result comparator; controlling the main control core to use the test case data to periodically detect whether the hardware processing unit in the graphic processor runs abnormally, comprising the following steps:
controlling the main control core to preferentially schedule the test case data to the dispatcher;
the control dispatcher dispatches a request task to the hardware processing unit according to the test case data, sends a request task index table corresponding to the request task to the decision maker, and sends a task correct result corresponding to the request task to the result comparator;
the control hardware processing unit processes the request task according to the test case data and sends the obtained task processing result to the result comparator;
the control result comparator compares the correct task result with the task processing result to obtain a comparison result, and sends the comparison result to the decision maker;
and the control decision device determines whether the hardware processing unit is abnormal or not according to the comparison result and updates the request task index table.
Optionally, in some implementations of the first aspect, the determining, by the control decider, whether the hardware processing unit is abnormal according to the comparison result includes:
if the comparison results are equal, determining that the hardware processing unit operates normally;
and if the comparison results are not equal, determining that the hardware processing unit is abnormal in operation.
Optionally, in some implementation manners of the first aspect, before controlling the main control core to use the test case data and periodically detect whether the hardware processing unit in the graphics processor runs abnormally, the method further includes:
controlling a data request signal sent by a main control core to a CPU of a host processor to acquire test case data, wherein the test case data is stored in the CPU and can be dynamically called;
or controlling the main control core to obtain the memory address of the detection firmware, wherein the test case data is packed in the detection firmware, and the detection firmware is loaded into the memory of the graphics processor.
Optionally, in some implementation manners of the first aspect, when the test case data is packed in the detection firmware, a loading manner in which the detection firmware is loaded into the memory of the graphics processor includes: PCI channel loading, JTAG channel loading and Flash power-on loading.
In a second aspect, the present application provides an apparatus for detecting abnormal hardware in a graphics processor, where the graphics processor includes a main control core and a hardware processing unit; the detection device comprises:
the system comprises an initialization module, an abnormality detection module and a task dispatching module;
the initialization module is to: when the graphic processor is powered on, initializing the graphic processor;
the anomaly detection module is to: if the initialization is normal, controlling the main control core to use the test case data to periodically detect whether the hardware processing unit in the graphics processor is abnormal in operation, wherein the test case data is used for detecting the operation state of the hardware processing unit;
the task dispatching module is used for: if the hardware processing unit is abnormal in operation, controlling the main control core to suspend task dispatch to the hardware processing unit;
the task dispatching module is further used for: and when detecting that the operation of the hardware processing unit is recovered to be normal in any subsequent period, controlling the main control core to recover to dispatch the tasks to the hardware processing unit.
In a third aspect, the present application provides an electronic device, comprising:
a memory and a processor, wherein the memory has executable code stored thereon;
when the executable code is called by the processor, the electronic device is caused to perform the steps in the method for detecting abnormal hardware in a graphics processor as described in any one of the first aspect and its implementation.
In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon executable code, which, when called by a processor of an electronic device, causes the electronic device to perform the steps in the method for detecting abnormal hardware in a graphics processor according to any one of the first aspect and the implementation manner thereof.
The technical scheme provided by the application has the following beneficial effects:
in the technical scheme of the application, the main control core is controlled to use the test case data to periodically detect whether a hardware processing unit in the graphics processor runs abnormally; if the abnormal condition exists, the main control core is controlled to suspend task dispatching to the hardware processing unit, the abnormal condition of the hardware processing unit can be found in time, the task dispatching to the abnormal hardware processing unit can be suspended in time, when the abnormal hardware processing unit detects that the operation of the hardware processing unit is recovered to be normal in any subsequent period next time, the task dispatching to the hardware processing unit is resumed, the blockage caused by the abnormality of the hardware processing unit when the graphic processor operates can be effectively avoided, the hardware processing unit can be timely resumed to be in a working state when the hardware processing unit is recovered to be normal, and the task processing efficiency of the graphic processor is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application, as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the application.
Fig. 1 is a schematic view of an application scenario according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a method for detecting abnormal hardware in a graphics processor according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an apparatus for detecting abnormal hardware in a graphics processor according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device in an embodiment of the present application.
Detailed Description
Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While embodiments of the present application are illustrated in the accompanying drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.
In order to facilitate understanding of the technical solution in the embodiment of the present application, an application scenario in the embodiment of the present application is described first, specifically as follows:
fig. 1 is a schematic view of an application scenario according to an embodiment of the present application.
As shown in fig. 1, the application scenario includes: the PCI bus comprises a host and PCI equipment, wherein the host comprises a CPU (central processing unit) and a DDR (double data rate) memory, and a data transmission channel is established between the CPU and the DDR; the PCI equipment comprises a graphic processor GPU and a display memory GDDR, a data transmission channel is established between the GPU and the GDDR, and the GPU and the CPU are connected through a PCI bus.
In the application scenario described above, after the GPU and CPU are powered on, the BIOS system of HOST scans for PCI devices and matches them with the appropriate drivers. After matching is successful, the HOST and the PCI equipment have the function of PCI protocol communication. And the GPU provides hardware information such as equipment resources, states and the like to the HOST, and the HOST CPU operates and adjusts the working state of the GPU equipment according to the grasped hardware information of the GPU, and loads necessary firmware and hardware acceleration tasks for the GPU.
When a certain hardware processing unit or some hardware processing units in the GPU are trapped in an inoperable state, not only is the task completion blocked, but also the next task cannot be participated, and the overall operation capability of the GPU is accordingly reduced, which is undesirable for restarting the GPU. Therefore, a mechanism for dynamically detecting whether the hardware processing unit is working normally is necessary to ensure that the GPU is working and not restarted.
In view of the above technical problems, an embodiment of the present application provides a method for detecting abnormal hardware in a graphics processor.
FIG. 2 is a flowchart illustrating a method for detecting abnormal hardware in a graphics processor according to an embodiment of the present disclosure.
As shown in fig. 2, the method for detecting abnormal hardware in a graphics processor in the embodiment of the present application includes:
201. when the graphics processor is powered on, the graphics processor is initialized.
In the embodiment of the application, when the graphics processor is powered on, the GPU is initialized, all hardware processing units are initialized, and the related hardware devices are initialized. And the initialized GPU updates the state of the GPU to HOST, and the HOST accesses the state of the GPU through a PCI bus. The GPU which is successfully initialized updates the GPU state to be ACTIVE (1) in the data structure describing the state of the GPU, and the GPU which is unsuccessfully initialized is inactive (0).
Optionally, the graphics processor further includes a timer, where the timer is configured to set a cycle duration of a hardware processing unit in the graphics processor for periodic detection, and when the graphics processor is powered on, the timer is initialized, and when the periodic detection starts in step 202, the timer resets the timing of the timer to zero, starts timing from the beginning, and is specifically used for timing in the periodic detection process.
202. And if the initialization is normal, controlling the main control core to use the test case data to periodically detect whether the hardware processing unit in the graphics processor operates abnormally, wherein the test case data is used for detecting the operating state of the hardware processing unit.
In the implementation of the present application, the initialization of the graphics processor is mainly to initialize a main control core in the graphics processor, where if the initialization is normal, it indicates that the main control core is normal, and if the initialization is not normal, it indicates that the main control core has a problem, that is, it detects that the main core is abnormal.
The test case data is functional data having a function of detecting the operating state of the hardware processing unit. The test case data includes: the data of the integral test case and the data of the independent test case; the overall test case data is the hardware processing unit types which are not distinguished, and all types of hardware processing units can be tested; the single test case data is designed for a certain type of hardware processing unit, and only the certain type of hardware processing unit can be tested. For example, the individual test case data includes: the test case data comprises test case C type data, test case B type data and test case A type data.
Further, the way in which the test case exists includes, but is not limited to: 1. a static form, packaged in detection firmware; 2. and the dynamic form is stored at the HOST side and is called at any time. The test case in the static form can be tested quickly or tested at fixed time when the test case is started, the dynamic form is called at variable time, and the test case in the dynamic form on the CPU side can be called and executed when the GPU meets the condition when the hardware processing unit needs to be detected.
Dynamic forms of test cases are rich in variety, such as cases for testing vertex shaders, cases for testing tessellation shaders, cases for testing geometry shaders, and so forth. The method can detect a certain hardware processing unit, store the hardware processing unit in the HOST side in a unified mode, and dynamically call related test cases according to test requirements. And when the test case stored at the HOST side is found to be insufficient to detect the hardware processing unit in the detection, dynamically selecting the use case to be distributed or dynamically compiling to generate the use case according to the feedback.
The test case in a dynamic form comprises an integral test and an independent test. The whole test case is the same as the test case packaged in the firmware, and all hardware processing units can be detected. The individual test cases are developed for each type of hardware processing unit and can be dynamically invoked. The dynamic form test case includes, but is not limited to, a hierarchical test, which is to uniformly detect an exception and belongs to a first layer. And the detection is carried out according to the unit type of the abnormal hardware processing unit and the marked hardware processing unit, so that the conflict of retesting other normal hardware processing units is avoided. Testing may also be performed according to a certain hardware processing unit type.
Optionally, the dynamic form is stored at the HOST side, and the specific calling mode of the test case data called at any time is as follows: and controlling a data request signal sent by the main control core to a CPU of the host processor to acquire test case data, wherein the test case data is stored in the CPU and can be dynamically called.
Optionally, in the static form, a specific calling manner of the test case data packaged in the detection firmware is as follows: and controlling the main control core to obtain the memory address of the detection firmware, wherein the test case data is packed in the detection firmware, and the detection firmware is loaded into the memory of the graphics processor.
Still further optionally, when the test case data is packed in the detection firmware, a loading manner in which the detection firmware is loaded into the memory of the graphics processor includes: PCI channel loading, JTAG channel loading and Flash power-on loading. Three ways of loading the detection firmware will be described in detail below.
In the embodiments of the present application, the hardware processing units include, but are not limited to, vertex shaders, tessellation shaders, geometry shaders, fragment shaders, and the like.
In the embodiment of the present application, the main control core is controlled to periodically detect whether the hardware processing unit in the graphics processor runs abnormally by using the test case data, and specifically may perform the following operations:
controlling the main control core to preferentially schedule the test case data to the dispatcher;
the control dispatcher dispatches a request task to the hardware processing unit according to the test case data, sends a request task index table corresponding to the request task to the decision maker, and sends a task correct result corresponding to the request task to the result comparator;
the control hardware processing unit processes the request task according to the test case data and sends the obtained task processing result to the result comparator;
the control result comparator compares the correct task result with the task processing result to obtain a comparison result, and sends the comparison result to the decision maker;
and the control decision device determines whether the hardware processing unit is abnormal or not according to the comparison result and updates the request task index table.
It should be noted that the request task index table may feed back the states of all hardware processing units, and the dispatcher and the decision maker may both access the state; the request task index list is used as a task dispatching basis of the main control core, and the main control core dispatches tasks according to the request task index list, so that tasks can be effectively prevented from being dispatched to abnormal hardware processing units.
Further optionally, the controlling the decision-making unit to determine whether the hardware processing unit is abnormal according to the comparison result includes:
if the comparison results are equal, determining that the hardware processing unit operates normally;
and if the comparison results are not equal, determining that the hardware processing unit is abnormal in operation.
The hardware processing unit operation exception comprises the following two conditions: 1. the hardware processing unit executes and processes the tasks, but the processing result of the tasks obtained after the processing is finished is incorrect; 2. when the task is overtime, the hardware processing unit does not process the task, and the comparison result is not equal.
203. And if the hardware processing unit is abnormal in operation, controlling the main control core to suspend task dispatch to the hardware processing unit.
In the embodiment of the application, the task dispatch to the abnormal hardware processing unit is stopped, so that the problems that the task overstock is caused and the overall performance of the graphic processor is reduced because the task cannot be executed in the abnormal hardware processing unit can be avoided.
Optionally, in the detection method according to the embodiment of the present application, a corresponding exception flag is set in advance for each hardware processing unit, for example, the exception flag is 1, which indicates that the hardware processing unit is abnormal; the abnormal flag is 0, which indicates that the hardware processing unit is normal, and the specific detection method further includes:
when the hardware processing unit runs abnormally, controlling the main control core to mark the abnormal mark of the hardware processing unit as abnormal; for example, the exception flag of the hardware processing unit is set to 1;
when the hardware processing unit returns to normal operation, controlling the main control core to mark the abnormal mark of the hardware processing unit as normal; for example, the exception flag of the hardware processing unit is set to 0.
204. And when the abnormal hardware processing unit is detected to be operated and recovered to be normal in any subsequent period, controlling the main control core to recover to dispatch the tasks to the hardware processing unit.
In the embodiment of the application, when the abnormal hardware processing unit detects that the abnormal hardware processing unit operates normally in the subsequent detection period, the main control core resumes to dispatch the task to the hardware processing unit.
Through periodic detection, the hardware processing unit which is recovered to be normal from the abnormity can be recovered to be normal in time, and therefore the task processing efficiency and the overall performance of the graphics processor are improved.
In the technical scheme of the application, the main control core is controlled to use the test case data to periodically detect whether a hardware processing unit in the graphics processor runs abnormally; if the abnormal condition exists, the main control core is controlled to suspend task dispatching to the hardware processing unit, the abnormal condition of the hardware processing unit can be found in time, the task dispatching to the abnormal hardware processing unit can be suspended in time, when the abnormal hardware processing unit detects that the operation of the hardware processing unit is recovered to be normal in any subsequent period next time, the task dispatching to the hardware processing unit is resumed, the blockage caused by the abnormality of the hardware processing unit when the graphic processor operates can be effectively avoided, the hardware processing unit can be timely resumed to be in a working state when the hardware processing unit is recovered to be normal, and the task processing efficiency of the graphic processor is improved.
The method for detecting abnormal hardware in the graphics processor is used in the graphics processor, and the graphics processor comprises a timer, a hardware processing unit, a result comparator, a decision maker, a dispatcher and other related hardware units.
In the above graphics processor, the method for detecting abnormal hardware in the graphics processor in the present application may be roughly divided into the following steps:
step 1: initializing a GPU, initializing a timer, initializing all hardware processing units and initializing related hardware equipment.
The step 1 specifically comprises the following steps: initializing a GPU, initializing a timer, initializing all hardware processing units and initializing related hardware equipment. And the initialized GPU updates the state of the GPU to HOST, and the HOST accesses the state of the GPU through a PCI bus. The GPU which is successfully initialized updates the GPU state to be ACTIVE (1) in the data structure describing the state of the GPU, and the GPU which is unsuccessfully initialized is inactive (0).
The data type describing the GPU state is a pool type, and a register (GPU _ status register) associated with GPU _ status is defined in a register segment commonly accessed by the CPU and the GPU, and the GPU state is updated by the GPU _ status register. The CPU and GPU have readable and writable rights to the GPU _ status register. After the CPU and the GPU are successfully communicated, the CPU writes inactive into the GPU _ status register, and when the GPU master control core is successfully initialized, the GPU _ status register is updated to ACTIVE. And entering the next step until the CPU reads that the GPU _ status register state is ACTIVE, otherwise, judging that the GPU master control core is abnormal.
Step 2: and judging whether the dynamic detection time condition of the hardware processing unit is met, and executing the rendering task or requesting the task to be dispatched.
The step 2 specifically comprises the following steps: and judging whether the dynamic detection time condition of the hardware processing unit is met. And under the condition that the dynamic detection time condition is not met, the GPU master control core starts to transfer the dispatchers, sends tasks to all hardware processing units and manages task data backup and release. And the dispatcher follows a low-power-consumption idle-priority strategy and dispatches the tasks according to the decision information of the decision maker. When the dynamic detection time condition is met, the GPU master control core acquires the address of the detection firmware and schedules the address to the dispatcher preferentially, the dispatcher dispatches the request tasks to all the hardware processing units, and meanwhile, the dispatcher forms a request task index list to inform the decision maker that the request tasks are sent out. Because the previous task is not processed and completed, the request task needs to be queued until the processing of the previous task is completed, and the request task can be processed. And releasing the backup of the task data by the dispatcher after the task is completed.
And step 3: the hardware processing unit processes the requested task.
The step 3 specifically comprises the following steps: the hardware processing unit processes the requested task. The request task enters the hardware processing unit, the dispatcher sends a corresponding correct value to the result comparator, the hardware processing unit starts to fetch an instruction address according to the request task, decodes an instruction code, executes the instruction, and outputs a result to the comparator after the instruction is executed. The request task can be set to be a simpler operation task or a complex operation task. The correct value corresponding to each request task is stored in a specified address or bound in the request task. When the request task becomes a dispatching object, the dispatcher searches the corresponding position of the request task and selects the correct value matched with the request task. The correct value may be a random number or a preselected low occurrence number to avoid false errors (false negative). For example, 1+2 is used as a request task and is distributed to the hardware processing unit A, and after the hardware processing unit A finishes processing the request task, an actual value obtained by operation is used as a first input and is transmitted to a result comparator of the hardware processing unit A; when "1+2" is the requested task, the dispatcher finds the set correct value "3", and dispatches the correct value "3" as the second input to the result comparator of hardware processing unit A.
And 4, step 4: and the result comparator compares the actual value with the correct value and outputs a comparison result.
The step 4 specifically comprises the following steps: and the result comparator compares the actual value with the correct value and outputs a comparison result.
The comparison result is used as input and is transmitted to the decision maker. The comparison results are divided into '0' and '1', wherein '0' represents that the comparison results are different, and '1' represents that the comparison results are the same. For example, after the operation of "1+2" is performed by the hardware processing unit A, the output actual value is "3", and compared with the correct value "3" in the result comparator, the values are equal, the result is the same, and the comparison result is "1". If the output actual value is not "3" and is compared with the correct value "3", the comparison result is "0". Some hardware processing units may be stuck in a stuck state due to high temperature or software problems, and cannot process the requested task, and the comparison result is "0".
And 5: the decision-making device receives the comparison results from different hardware processing units, and fills the corresponding comparison results according to the request task index list notified to the decision-making device by the dispatcher.
The step 5 specifically comprises the following steps: the decision-making device collects the comparison results from different hardware processing units, and fills the corresponding comparison results according to the request task index list notified to the decision-making device by the dispatcher. The request task index list may feed back the status of all hardware processing units, accessible to both the dispatcher and the decider. And the decision maker judges the state of the hardware processing unit according to the comparison result in the request task index list. The comparison result is '1', which indicates that the hardware processing unit is normal; the comparison result is '0', the hardware processing unit is abnormal, and the decision maker marks the abnormal hardware processing unit and informs the dispatcher of the current state of the hardware processing unit.
Step 6: dispatcher combines state of hardware processing unit to process task dispatch
The step 6 specifically comprises: the dispatcher processes the dispatching of tasks in combination with the state of the hardware processing unit. The dispatcher continues to dispatch tasks to the normal hardware processing units. And according to specific requirements, distributing the task backup to other hardware processing units or discarding the task backup, and then resetting the abnormal hardware processing units. The dispatcher forbids the hardware processing unit marked as abnormal by the decision maker, and does not dispatch the task to the hardware processing unit, the step 2 is carried out, the next detection request task arrives, whether the work is recovered or not is judged according to the actual comparison result, the work is added into the work queue again, and the task is dispatched to the work queue. The decision-making device collects the comparison results from different hardware processing units, searches the request task index of the hardware processing unit matched with the result information according to the information transmitted to the decision-making device by each hardware processing unit, converts the comparison results into the state of the hardware processing unit, updates the state into the request task index list, records the latest state of each hardware processing unit in the list one by one through the decision-making device, informs the distributor to read a new task index list after updating, and finally resets the hardware processing unit with abnormal state. And the dispatcher receives the updating signal of the decision maker, reads a new request task index list, obtains the latest hardware processing unit state in the request task index list, and forms a new hardware processing unit dispatching catalog in the dispatcher. And then selectively dispatching the tasks according to the rendering requirements, continuing to dispatch the tasks to the normal hardware processing units, transferring the tasks to the abnormal hardware processing units, and temporarily forbidding the hardware processing units from being used.
What is meant by the detection firmware below is the firmware used to test the various hardware processing units, which results from compilation in the CPU. The method for loading the detection firmware in the application includes, but is not limited to, the following three methods: PCI channel loading, JTAG channel loading and Flash power-on loading.
Wherein, the PCI channel loads: the PCI device and the HOST carry out data interaction through a PCIe protocol, and the HOST can access the GDDR of the GPU through an address mapping mode. The target address of the GPU side is mapped to the CPU side through the PCI space, the target address of the GPU side can be operated in the driving process, and after the position of a firmware file and the size of the firmware are indicated, the firmware is written into the address or the address is written into the address through DMA burst.
Loading a JTAG channel: after installing the JTAG driver, HOST may monitor and operate the GDDR global address space using a USB to JTAG approach. When the JTAG writing is operated, the JTAG command is executed in the command line, the firmware path, the size and the target address of the GDDR are indicated, and after the JTAG command is executed, the firmware is loaded to the address space appointed by the GDDR from the Host side through the JTAG channel.
And (3) Flash power-on loading: the firmware is burned to the flash of the GPU using a programmer through a JTAG emulator. After the firmware is burnt into the Flash, even if the Flash is powered off, the data stored in the Flash cannot be lost. When the system is powered on again, the GPU can directly acquire instructions and data from the flash.
Corresponding to the embodiment of the application function implementation method, the application also provides a detection device of abnormal hardware in the graphic processor, an electronic device and a corresponding embodiment.
Fig. 3 is a schematic structural diagram of an apparatus for detecting abnormal hardware in a graphics processor according to an embodiment of the present application.
As shown in fig. 3, the apparatus 30 for detecting abnormal hardware in a graphics processor in the embodiment of the present application includes:
an initialization module 301, an abnormality detection module 302 and a task dispatching module 303;
the initialization module 301 is configured to: when the graphic processor is powered on, initializing the graphic processor;
the anomaly detection module 302 is to: if the initialization is normal, controlling the main control core to use the test case data to periodically detect whether the hardware processing unit in the graphics processor is abnormal in operation, wherein the test case data is used for detecting the operation state of the hardware processing unit;
the task dispatch module 303 is configured to: if the hardware processing unit is abnormal in operation, controlling the main control core to suspend task dispatching to the hardware processing unit;
the task dispatch module 303 is further configured to: and when detecting that the operation of the hardware processing unit is recovered to be normal in any subsequent period, controlling the main control core to recover to dispatch the tasks to the hardware processing unit.
Optionally, in some embodiment modes, an exception flag is set for the hardware processing unit; the anomaly detection module 302 is further configured to: when the hardware processing unit operates abnormally, controlling the main control core to mark the abnormal mark of the hardware processing unit as abnormal; and when the hardware processing unit returns to normal operation, controlling the main control core to mark the abnormal mark of the hardware processing unit as normal.
Optionally, in some embodiment modes, the test case data includes: the data of the integral test case and the data of the independent test case; the data of the whole test case is the hardware processing unit types which are not distinguished, and all types of hardware processing units can be tested; the individual test case data is designed for a certain type of hardware processing unit and can only test the hardware processing unit of the certain type, and the individual test case data comprises the following components: the test case data comprises test case C type data, test case B type data and test case A type data.
Optionally, in some embodiment modes, the graphics processor further includes: a dispatcher, a decision maker and a result comparator; the anomaly detection module 302 is specifically configured to perform the following operations:
controlling the main control core to preferentially schedule the test case data to the dispatcher;
the control dispatcher dispatches a request task to the hardware processing unit according to the test case data, sends a request task index table corresponding to the request task to the decision maker, and sends a task correct result corresponding to the request task to the result comparator;
the control hardware processing unit processes the request task according to the test case data and sends the obtained task processing result to the result comparator;
the control result comparator compares the correct task result with the task processing result to obtain a comparison result, and sends the comparison result to the decision maker;
and the control decision device determines whether the hardware processing unit is abnormal or not according to the comparison result and updates the request task index table.
Further optionally, the anomaly detection module 302 controls the decision-making unit to determine whether the hardware processing unit is abnormal according to the comparison result, and specifically performs the following operations:
if the comparison results are equal, the anomaly detection module 302 determines that the hardware processing unit operates normally;
if the comparison result is not equal, the anomaly detection module 302 determines that the hardware processing unit is abnormal in operation.
The hardware processing unit operation exception comprises the following two conditions: 1. the hardware processing unit executes and processes the tasks, but the processing result of the tasks obtained after the processing is finished is incorrect; 2. when the task is overtime, the hardware processing unit does not process the task, and the comparison result is not equal.
Optionally, in some embodiment modes, before the exception detecting module 302 controls the main control core to use the test case data to periodically detect whether the hardware processing unit in the graphics processor runs an exception, the exception detecting module 302 is further configured to perform the following operations to obtain the test case data:
controlling a data request signal sent by a main control core to a CPU of a host processor to acquire test case data, wherein the test case data is stored in the CPU and can be dynamically called;
or controlling the main control core to obtain the memory address of the detection firmware, wherein the test case data is packed in the detection firmware, and the detection firmware is loaded into the memory of the graphics processor.
Optionally, in some embodiment modes, when the test case data is packed in the detection firmware, a loading mode in which the detection firmware is loaded into the memory of the graphics processor includes: PCI channel loading, JTAG channel loading and Flash power-on loading.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs operations and the advantages thereof have been described in detail in the embodiment related to the method, and will not be elaborated upon herein.
Fig. 4 is a schematic structural diagram of an electronic device in an embodiment of the present application.
As shown in fig. 4, the electronic device 40 in the embodiment of the present application includes a memory 401 and a processor 402. The memory has stored thereon executable code that, when executed by the processor, causes the processor to perform the method of any of the embodiments described above.
The Processor 402 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 401 may include various types of storage units, such as a system memory, a Read Only Memory (ROM), and a permanent storage device. Wherein the ROM may store static data or instructions that are required by the processor 402 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. Further, the memory 401 may comprise any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 401 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.
The memory 401 has stored thereon executable code which, when processed by the processor 402, may cause the processor 402 to perform some or all of the methods described above.
Furthermore, the method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing some or all of the steps of the above-described method of the present application.
Alternatively, the present application may also be embodied as a computer-readable storage medium (or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or an electronic device, a server, etc.), causes the processor to perform part or all of the steps of the above-described method according to the present application.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it should also be noted that, herein, relationships such as first and second, etc., are intended only to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms include, or any other variation is intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application.
The foregoing description of the embodiments of the present application has been presented for purposes of illustration and description and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (10)

1. The detection method of the abnormal hardware in a graphic processor is characterized in that the graphic processor comprises a main control core and a hardware processing unit; the detection method comprises the following steps:
when a graphic processor is powered on, initializing the graphic processor;
if the initialization is normal, controlling the main control core to use test case data to periodically detect whether a hardware processing unit in the graphics processor runs abnormally, wherein the test case data is used for detecting the running state of the hardware processing unit;
if the hardware processing unit runs abnormally, controlling the main control core to suspend task dispatching to the hardware processing unit;
and when detecting that the operation of the hardware processing unit is recovered to be normal in any subsequent period, controlling the main control core to recover to dispatch the tasks to the hardware processing unit.
2. The detection method according to claim 1, wherein an exception flag is set for the hardware processing unit; the detection method further comprises the following steps:
when the hardware processing unit runs abnormally, controlling the main control core to mark an abnormal mark of the hardware processing unit as abnormal;
and when the hardware processing unit returns to normal operation, controlling the main control core to mark the abnormal mark of the hardware processing unit as normal.
3. The detection method according to claim 1, wherein the test case data comprises: the data of the integral test case and the data of the independent test case; the overall test case data is the hardware processing unit types which are not distinguished and can be used for testing all types of the hardware processing units; the single test case data is designed for a certain type of hardware processing unit, and only the certain type of hardware processing unit can be tested.
4. The detection method according to claim 1, wherein the graphics processor further comprises: a dispatcher, a decision maker and a result comparator; the controlling the main control core to use the test case data to periodically detect whether the hardware processing unit in the graphics processor runs abnormally comprises the following steps:
controlling the main control core to preferentially schedule the test case data to the dispatcher;
controlling the dispatcher to dispatch a request task to the hardware processing unit according to the test case data, sending a request task index table corresponding to the request task to the decision maker, and sending a task correct result corresponding to the request task to the result comparator;
controlling the hardware processing unit to process the request task according to the test case data and sending an obtained task processing result to the result comparator;
controlling the result comparator to compare the correct task result with the processing task result to obtain a comparison result, and sending the comparison result to the decision maker;
and controlling the decision maker to determine whether the hardware processing unit is abnormal or not according to the comparison result, and updating the request task index table.
5. The detection method according to claim 4, wherein the controlling the decision maker to determine whether the hardware processing unit is abnormal according to the comparison result comprises:
if the comparison results are equal, determining that the hardware processing unit operates normally;
and if the comparison results are not equal, determining that the hardware processing unit is abnormal in operation.
6. The method according to claim 4, wherein before controlling the main control core to periodically detect whether the hardware processing unit in the graphics processor is running abnormally using the test case data, further comprising:
controlling a data request signal sent by the main control core to a CPU of a host computer to acquire the test case data, wherein the test case data is stored in the CPU and can be dynamically called;
or controlling the main control core to obtain a memory address of detection firmware, wherein the test case data is packed in the detection firmware, and the detection firmware is loaded into a memory of the graphics processor.
7. The detection method according to claim 6, wherein when the test case data is packed in the detection firmware, a loading manner in which the detection firmware is loaded into the memory of the graphics processor comprises: PCI channel loading, JTAG channel loading and Flash power-on loading.
8. The detection device of the abnormal hardware in a graphic processor is characterized in that the graphic processor comprises a main control core and a hardware processing unit; the detection device includes:
the system comprises an initialization module, an abnormality detection module and a task dispatching module;
the initialization module is configured to: when a graphic processor is powered on, initializing the graphic processor;
the anomaly detection module is to: if the initialization is normal, controlling the main control core to use test case data to periodically detect whether a hardware processing unit in the graphics processor runs abnormally, wherein the test case data is used for detecting the running state of the hardware processing unit;
the task dispatching module is used for: if the hardware processing unit runs abnormally, controlling the main control core to suspend task dispatching to the hardware processing unit;
the task dispatching module is further configured to: and when detecting that the operation of the hardware processing unit is recovered to be normal in any subsequent period, controlling the main control core to recover to dispatch the tasks to the hardware processing unit.
9. An electronic device, comprising:
a memory and a processor, wherein the memory has executable code stored thereon;
the executable code, when invoked by the processor, causes an electronic device to perform the steps in the method for detecting anomalous hardware in a graphics processor as claimed in any of claims 1 to 7.
10. A computer readable storage medium having stored thereon executable code which, when invoked by a processor of an electronic device, causes the electronic device to perform the steps in the method of detecting anomalous hardware in a graphics processor as claimed in any one of claims 1 to 7.
CN202310030933.4A 2023-01-10 2023-01-10 Method and device for detecting abnormal core in graphics processor and electronic equipment Active CN115827355B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310030933.4A CN115827355B (en) 2023-01-10 2023-01-10 Method and device for detecting abnormal core in graphics processor and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310030933.4A CN115827355B (en) 2023-01-10 2023-01-10 Method and device for detecting abnormal core in graphics processor and electronic equipment

Publications (2)

Publication Number Publication Date
CN115827355A true CN115827355A (en) 2023-03-21
CN115827355B CN115827355B (en) 2023-04-28

Family

ID=85520523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310030933.4A Active CN115827355B (en) 2023-01-10 2023-01-10 Method and device for detecting abnormal core in graphics processor and electronic equipment

Country Status (1)

Country Link
CN (1) CN115827355B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116820837A (en) * 2023-06-28 2023-09-29 合芯科技有限公司 Exception handling method and device for system component

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101236515A (en) * 2007-01-31 2008-08-06 迈普(四川)通信技术有限公司 Multi-core system single-core abnormity restoration method
US20130160016A1 (en) * 2011-12-16 2013-06-20 Advanced Micro Devices, Inc. Allocating Compute Kernels to Processors in a Heterogeneous System
US20170003707A1 (en) * 2013-08-28 2017-01-05 Via Technologies, Inc. Single-core wakeup multi-core synchronization mechanism
US10019576B1 (en) * 2015-04-06 2018-07-10 Intelligent Automation, Inc. Security control system for protection of multi-core processors
CN109815043A (en) * 2019-01-25 2019-05-28 华为技术有限公司 Fault handling method, relevant device and computer storage medium
CN110622162A (en) * 2017-05-10 2019-12-27 金德祐 Computer with independent user calculating part
US20210165730A1 (en) * 2021-02-12 2021-06-03 Intel Corporation Hardware reliability diagnostics and failure detection via parallel software computation and compare
CN113495857A (en) * 2020-03-20 2021-10-12 辉达公司 Memory error isolation techniques

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101236515A (en) * 2007-01-31 2008-08-06 迈普(四川)通信技术有限公司 Multi-core system single-core abnormity restoration method
US20130160016A1 (en) * 2011-12-16 2013-06-20 Advanced Micro Devices, Inc. Allocating Compute Kernels to Processors in a Heterogeneous System
US20170003707A1 (en) * 2013-08-28 2017-01-05 Via Technologies, Inc. Single-core wakeup multi-core synchronization mechanism
US10019576B1 (en) * 2015-04-06 2018-07-10 Intelligent Automation, Inc. Security control system for protection of multi-core processors
CN110622162A (en) * 2017-05-10 2019-12-27 金德祐 Computer with independent user calculating part
CN109815043A (en) * 2019-01-25 2019-05-28 华为技术有限公司 Fault handling method, relevant device and computer storage medium
CN113495857A (en) * 2020-03-20 2021-10-12 辉达公司 Memory error isolation techniques
US20210165730A1 (en) * 2021-02-12 2021-06-03 Intel Corporation Hardware reliability diagnostics and failure detection via parallel software computation and compare

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116820837A (en) * 2023-06-28 2023-09-29 合芯科技有限公司 Exception handling method and device for system component

Also Published As

Publication number Publication date
CN115827355B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
CN1882917B (en) Method and apparatus for monitoring and resetting a co-processor
US9449717B2 (en) Memory built-in self-test for a data processing apparatus
US8065492B2 (en) System and method for early detection of failure of a solid-state data storage system
US7730293B2 (en) Hard disk drive self-test system and method
US20060010282A1 (en) Method and apparatus to boot a system by monitoring an operating status of a NAND flash memory
US7650259B2 (en) Method for tuning chipset parameters to achieve optimal performance under varying workload types
TWI667588B (en) Computing device, method and machine readable storage media for detecting unauthorized memory accesses
US7809985B2 (en) Offline hardware diagnostic environment
US7555671B2 (en) Systems and methods for implementing reliability, availability and serviceability in a computer system
US20120042215A1 (en) Request processing system provided with multi-core processor
JP5038798B2 (en) Memory testing
CN115827355A (en) Detection method and detection device for abnormal core in graphic processor and electronic equipment
US7373493B2 (en) Boot methods, computer systems, and production methods thereof
US8065497B2 (en) Data management method, and storage apparatus and controller thereof
TW201346756A (en) Processor with second jump execution unit for branch misprediction
US10657003B2 (en) Partial backup during runtime for memory modules with volatile memory and non-volatile memory
JP2008511050A (en) Error response by data processing system and peripheral devices
US6971003B1 (en) Method and apparatus for minimizing option ROM BIOS code
JP2004302731A (en) Information processor and method for trouble diagnosis
WO1998013762A1 (en) Processing system and method for reading and restoring information in a ram configuration
EP1630668A1 (en) Boot method based on hibernation files for preventing unauthorized modifications
KR20050064262A (en) Method for initializing a plurality of devices using job-scheduler
CN101107591A (en) Computer system and method for activating basic program therein
JPH08272756A (en) Method for starting multiprocessor system
US20070179635A1 (en) Method and article of manufacure to persistently deconfigure connected elements

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant