CN115904850B - Power-on detection method of multi-core processor, readable storage medium and GPU - Google Patents

Power-on detection method of multi-core processor, readable storage medium and GPU Download PDF

Info

Publication number
CN115904850B
CN115904850B CN202310028815.XA CN202310028815A CN115904850B CN 115904850 B CN115904850 B CN 115904850B CN 202310028815 A CN202310028815 A CN 202310028815A CN 115904850 B CN115904850 B CN 115904850B
Authority
CN
China
Prior art keywords
processor core
self
power
checking
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310028815.XA
Other languages
Chinese (zh)
Other versions
CN115904850A (en
Inventor
何睿
张坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenliu Micro Intelligent Technology Shenzhen Co ltd
Original Assignee
Shenliu Micro Intelligent Technology Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenliu Micro Intelligent Technology Shenzhen Co ltd filed Critical Shenliu Micro Intelligent Technology Shenzhen Co ltd
Priority to CN202310028815.XA priority Critical patent/CN115904850B/en
Publication of CN115904850A publication Critical patent/CN115904850A/en
Application granted granted Critical
Publication of CN115904850B publication Critical patent/CN115904850B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the application discloses a power-on detection method of a multi-core processor, a readable storage medium and a GPU, wherein the method comprises the following steps: after power-on, loading firmware into a local memory, starting and executing local power-on self-test after the firmware is loaded, updating a local power-on self-test result, and if a superior processor core exists, notifying the superior processor core after the local power-on self-test is completed; if the local power-on self-checking result indicates that the starting is normal and the current processor core manages the lower processor core, controlling the managed lower processor core to power on; after receiving the power-on self-checking completion notification of the lower processor core, reading the power-on self-checking result of the lower processor core, updating starting record information for representing the starting state of the managed lower processor core according to the power-on self-checking result of the lower processor core, and reporting the upper processor core after updating is completed if the upper processor core exists. The method and the device realize hierarchical and layered power-on self-check, starting quantity statistics and management of the processor cores.

Description

Power-on detection method of multi-core processor, readable storage medium and GPU
Technical Field
The application relates to the technical field of multi-core processors, in particular to a power-on detection method, a readable storage medium and a GPU of the multi-core processor.
Background
The obvious difference between the multi-core processor and the common processor is that the multi-core processor has hundreds to thousands of processor cores, and due to the functional performance difference of the processor cores, the multi-core processor has a certain difference in architecture and performance from the common processor. To confirm whether the processor cores in the multi-core processor have been properly started, a functional check and a system-wide check are required for each processor core after power-up.
The prior art often adds additional monitoring subroutines to the multiprocessor to assist in power-up detection of the multi-core processor. However, adding additional monitoring subroutines adds complexity to the processor software design and is somewhat invasive to the processor.
Disclosure of Invention
The main purpose of the application is to provide a power-on detection method, a readable storage medium and a GPU of a multi-core processor, which can solve the technical problems that in the prior art, a monitoring program is additionally added to increase the design complexity of processor software and invasiveness exists.
To achieve the above object, a first aspect of the present application provides a power-on detection method of a multi-core processor, including:
After power-on, loading corresponding firmware into a local memory, starting and executing local power-on self-test after the firmware is loaded, updating a local power-on self-test result, and if a superior processor core exists, notifying the superior processor core after the local power-on self-test is completed;
if the local power-on self-checking result indicates that the starting is normal and the current processor core manages the lower processor core, controlling the managed lower processor core to power on;
after receiving the power-on self-checking completion notification of the lower processor core, reading the power-on self-checking result of the lower processor core, updating starting record information for representing the starting state of the managed lower processor core according to the power-on self-checking result of the lower processor core, and reporting the upper processor core after updating is completed if the upper processor core exists.
To achieve the above object, a second aspect of the present application provides a computer-readable storage medium storing a computer program, which when executed by a processor core causes the processor core to perform:
after power-on, loading corresponding firmware into a local memory, starting and executing local power-on self-test after the firmware is loaded, updating a local power-on self-test result, and if a superior processor core exists, notifying the superior processor core after the local power-on self-test is completed;
If the local power-on self-checking result indicates that the starting is normal and the current processor core manages the lower processor core, controlling the managed lower processor core to power on;
after receiving the power-on self-checking completion notification of the lower processor core, reading the power-on self-checking result of the lower processor core, updating starting record information for representing the starting state of the managed lower processor core according to the power-on self-checking result of the lower processor core, and reporting the upper processor core after updating is completed if the upper processor core exists.
To achieve the above object, a third aspect of the present application provides a GPU, where the GPU is integrated with a plurality of processor cores, and the power-up detection is performed by the processor cores according to the method of any one of the preceding claims when the GPU is powered up.
By adopting the embodiment of the application, the method has the following beneficial effects:
according to the method and the device, under the condition that the complexity of the design of the processor software is increased without additionally increasing a monitoring program, the upper processor core is used for controlling the lower processor core to power on, and the power-on self-checking result of the lower processor core is uploaded, so that the number of the processor cores for starting normally and starting abnormally is counted, and the hierarchical and layered power-on self-checking and management are realized. The invasiveness to the multi-core processor is reduced, and the influence on the processor is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Wherein:
FIG. 1 is an application environment diagram of a power-on detection method of a multi-core processor in an embodiment of the present application;
FIG. 2 is a flow chart of power-on detection of a multi-core processor in an embodiment of the application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
FIG. 1 is an application environment diagram of a power-on detection method of a multi-core processor in one embodiment. Referring to fig. 1, a first layer of the multi-core processor includes 1 highest-level processor core, namely, a first-layer processor core 1, the first-layer processor core 1 manages 3 secondary processor cores, namely, a second-layer processor core 1, a second-layer processor core 2 and a second-layer processor core 3, the second-layer processor core 1 manages 2 lower-level processor cores, namely, a third-layer processor core 1 and a third-layer processor core 2, the second-layer processor core 2 manages 1 lower-level processor core, namely, a third-layer processor core 3, and the second-layer processor core 3 manages 3 lower-level processor cores, namely, a third-layer processor core 4, a third-layer processor core 5 and a third-layer processor core 6. Each level of processor cores may perform a corresponding power-up detection method as the current processor core. The highest level processor core has no upper level processor core but has lower level processor cores, 6 lower level processor cores have only upper level processor cores, and the middle 3 secondary processor cores have both upper level processor cores and lower level processor cores.
Of course, FIG. 1 is merely an exemplary example, and in practice a multi-core processor may include hundreds or thousands of processor cores whose power-up sequence and hierarchical relationship may be even more complex as shown in the tree structure of FIG. 1.
As shown in FIG. 2, in one embodiment, a power-on detection method for a multi-core processor is provided. The method is applied to a current processor in the multi-core processor, and the power-on detection method of the multi-core processor specifically comprises the following steps:
s100: after power-on, loading corresponding firmware into a local memory, starting and executing local power-on self-test after the firmware is loaded, updating a local power-on self-test result, and if the upper processor core exists, notifying the upper processor core after the local power-on self-test is completed.
Specifically, the multi-core processor includes a plurality of processor cores, and a certain priority startup level exists between the processor cores. The highest-level processor core is started and electrified for self-checking firstly, the next-level processor core governed by the highest-level processor core is controlled to be started, namely the secondary processor core is started after the self-checking is started and electrified for normal, the secondary processor core governed by the secondary processor core is controlled to be started after the self-checking is started and electrified for normal, and the like, and the cascade hierarchical starting and the electrified self-checking are realized by a mode of controlling the lower level of the upper level.
The multi-core processor of the present application may be a graphics processor (graphics processing unit) GPU, CPU, MCU, GPGPU, DPU multi-core, or the like, without limitation. The multi-core processor includes at least two levels of processor cores.
The current processor is a powered-on processor core, and at the same time, the multi-core processor can have a plurality of current processor cores powered on self-test at the same time or control a lower processor core to power on self-test.
If the current processor core is the highest level processor core, its power up is controlled by the outside to power up. If the current processor core is any level processor core except the highest level processor core which is started initially, the current processor core is powered on by controlling the power supply of the current processor core by the upper level processor core.
The current processor core is powered up and then loads firmware locally. If the current processor core is the highest level processor core, the current processor core loads the corresponding firmware from FLASH to the local memory.
If the current processor core is a non-highest level processor core, the current processor core loads the corresponding firmware from the Host PC to the local memory via PCIe.
After the firmware is loaded, the current processor core starts a local power-on self-checking program to execute local power-on self-checking, and in the power-on self-checking process, the current processor core stores or updates a local power-on self-checking result according to the actual self-checking condition. The power-on self-test is the self-test started at the most beginning just after power-on, and is mainly used for self-testing the performance and state of hardware and/or software.
If the local self-test does not find an abnormality, the current processor core writes the start correct code into the specified memory address as a final power-on self-test result. Of course, the corresponding register states can also be set at the same time.
If the current processor core is a non-highest level processor core, i.e. an upper level processor core exists, the current processor core notifies the upper level processor core after the local power-on self-test is completed. At this time, the current processor core is a lower processor core of its upper processor core.
After receiving the power-on self-checking completion notification of the current processor core, the upper processor core reads the power-on self-checking result of the current processor core, updates start record information for representing the start state of the lower processor core governed by the current processor core according to the power-on self-checking result of the current processor core, and if the upper processor core exists in the upper processor core, the upper processor core reports the upper processor core after updating.
If the current processor core is the highest level processor core, i.e., there is no upper level processor core, then there is no need to notify the upper level processor core.
S200: and if the local power-on self-checking result indicates that the starting is normal and the current processor core manages the lower processor core, controlling the managed lower processor core to power on.
Specifically, if the local power-on self-test result is the start correct code, the start is indicated to be normal. If the current processor core jurisdictions have a lower processor core, it may then control the lower processor core that it jurisdictions to power up so that the lower processor core performs the step of S100 after powering up: loading corresponding firmware into a local memory, starting and executing local power-on self-test after the firmware is loaded, updating a local power-on self-test result, and informing the upper processor core after the local power-on self-test is completed if the upper processor core exists.
S300: after receiving the power-on self-checking completion notification of the lower processor core, reading the power-on self-checking result of the lower processor core, updating starting record information for representing the starting state of the managed lower processor core according to the power-on self-checking result of the lower processor core, and reporting the upper processor core after updating is completed if the upper processor core exists.
Specifically, after the current processor core receives the power-on self-checking completion notification of the lower processor core, the current processor core will go to the designated memory address of the lower processor core to read the power-on self-checking result of the lower processor core.
The lower processor core governed by the current processor core may include at least one, and under normal conditions, the current processor core may read the power-on self-test result of each lower processor core, and update the startup record information according to the power-on self-test results. The starting record information stores the power-on self-checking result of the lower processor core.
If the current processor core has the upper processor core, after the startup record information is updated, the current processor core informs the upper processor core of the current processor core, so that the upper processor core reads the startup record information updated by the current processor core to realize upward transmission of the power-on self-checking result. The self-test result is transmitted from bottom to top, and finally, the power-on self-test result of all the secondary processor cores and the low-level processor cores is transmitted to the highest-level processor core.
In the embodiment, under the condition that the complexity of the design of the processor software is increased without additionally increasing a monitoring program, the upper processor core is used for controlling the lower processor core to power on, and the power-on self-checking result of the lower processor core is uploaded, so that the number of the processor cores with normal starting and abnormal starting is counted, and the hierarchical and layered power-on, self-checking and management are realized. The invasiveness to the multi-core processor is reduced, and the influence on the processor is reduced.
In one embodiment, after the local power-on self-test is completed, the method further comprises:
if the local power-on self-checking result indicates that the function self-checking task distributed by the upper processor core is received after the starting is normal, the function self-checking program is started to execute the received function self-checking task, and the upper processor core is informed after the function self-checking result is obtained;
After reading the power-on self-test result of the lower processor core, the method further comprises:
if the power-on self-checking result of the lower processor core indicates that the starting is normal, distributing a corresponding functional self-checking task to the lower processor core;
after the function self-checking completion notification of the lower processor core is received, the function self-checking result of the lower processor core is read, whether the function self-checking result of the lower processor core is correct or not is checked, the function record information used for representing whether the function of the managed lower processor core is normal or not is updated according to the obtained checking result, and the upper processor core is reported after the function record information is updated.
Specifically, only the processor core with normal power-on self-test is dispatched with the functional self-test task, and the processor core with abnormal power-on self-test is powered off and does not dispatch the functional self-test task.
Each powered-on self-test normal processor core's functional self-test task is dispatched by its upper processor core. The functional self-checking tasks of the highest level processor cores are self-dispatching.
The functional self-checking task comprises a graphic computing task, and the functional self-checking program comprises a graphic computing self-checking program.
The current processor core starts a functional self-checking program after receiving a functional self-checking task distributed by an upper processor core, the functional self-checking program is started by the power-on self-checking program, the received functional self-checking task is executed after entering the functional self-checking program, and the upper processor core is reported after a functional self-checking result is obtained.
The current processor core sends corresponding functional self-checking tasks to the lower processor core after the power-on self-checking result indication of the lower processor core is started normally. After receiving the function self-checking task, the subordinate processor core starts the function self-checking program to execute the received function self-checking task, and informs the current processor core after obtaining the function self-checking result.
After the current processor core receives the function self-checking completion notification of the lower processor core, the function self-checking result of the lower processor core is read, whether the function self-checking result of the lower processor core is correct or not is checked, the function record information is updated according to the function self-checking result, and the updated function record information is reported to the upper processor core so as to report the function self-checking result of the administrated lower processor core to the upper processor core.
The embodiment distributes the functional self-checking task after the power-on self-checking of the processor core passes, so that the processor core continues to execute the functional self-checking, double detection of the power-on self-checking and the functional self-checking is realized, and the normal of the processor core entering a normal program is further ensured.
In one embodiment, starting and performing the local power-on self test in step S100 includes:
if an abnormal event occurs in the local power-on self-checking process, the power-on self-checking is stopped, the self-checking progress is recorded, then the power-on self-checking is switched to an abnormal working mode, and the self-repairing is started and executed;
If the self-repairing fails and the upper processor core exists, the abnormal code corresponding to the abnormal with the repairing failure is stored as a local power-on self-checking result, and then the abnormal is reported to the upper processor core so as to request the upper processor to check the local abnormal to repair the abnormal;
if the self-repairing is successful or the abnormal repairing is successful, the abnormal working mode is exited, and the local power-on self-inspection is continuously executed according to the self-inspection progress.
Specifically, an exception event may occur during each processor core power-on self-test. Exception events, i.e., exception types, may include processor instruction architecture specification exceptions, such as divide by zero exceptions, memory management exceptions, such as page misses, and the like.
The processor core can suspend power-on self-check after finding out own abnormal event, record the progress of power-on self-check, namely enter the abnormal working mode after preserving the context, self-repair in the abnormal working mode first, if self-repair is unsuccessful, the upper processor core can be requested to help to perform abnormal repair.
Before requesting the upper processor core to help the exception repair, the current processor core performs type screening on the exception event which cannot be repaired by itself to obtain an exception code, and stores the exception code as a power-on self-checking result to a designated memory address. The upper processor core performs exception repair on the current processor core according to the exception code of the current processor core.
If the current processor core is the highest level processor core, there is no upper level processor core, then no other processor core can be powered up after the self-repair failure. At this point, the highest level processor core may issue a repair failure warning and/or display an exception cause or exception code after multiple attempts at self-repair failure to instruct the worker to repair.
In another embodiment, the multiprocessor includes, for example, two highest level processor cores, i.e., a scheme employing dual main processor cores. The two highest level processor cores are started simultaneously and are mutually backed up. After the power-on self-test of the two highest-level processor cores is completed, the power-on self-test results are recorded in the preset memory addresses respectively. When any highest-level processor core generates an abnormal event, the other highest-level processor core is informed through an inter-core communication mechanism, the abnormal highest-level processor core checked by the other highest-level processor core performs main processor core abnormal event processing, and an abnormal event processing mode comprises, but is not limited to, restarting or shutting down an abnormal processor core power supply for abnormal processor core hardware.
If the self-repairing of the current processor core is successful or the abnormal repairing is successful by the help of the upper processor core, the current processor core exits from the abnormal working mode to enter a power-on self-checking program to restore the context, and the local power-on self-checking is continuously executed according to the paused self-checking progress.
In addition, the abnormal events of the lower processor cores can be discovered and timely processed by the high-level processor cores through an active query mode. Therefore, abnormal communication of the lower processor cores can be avoided, and the occurrence that the high-level processor cores cannot repair the abnormal state of the lower processor cores in time is caused.
The embodiment realizes the power-on repair of the processor core by two repair modes of self-repair and request superior auxiliary repair. The method can effectively solve the problem of abnormal power-on of the processor core, and ensures that the processor core can be successfully started to a certain extent.
In one embodiment, after powering up the lower processor core and before receiving the power-up self-test completion notification of the lower processor core, the method further comprises:
if the abnormal information of the lower processor core is received and then the abnormal code of the lower processor core is read, the lower processor core is subjected to abnormal repair;
if the abnormal repair fails, determining whether the abnormal repair times exceed a first threshold value, and if the abnormal repair times do not exceed the first threshold value, carrying out abnormal repair on the lower processor core again;
if the first threshold value is exceeded, performing power-off processing on the lower processor core;
if the upper processor core exists, the abnormal processing state of the lower processor core which is powered off is reported to the upper processor core.
Specifically, when the current processor core has a lower processor core, in the power-on self-checking process of the lower processor core, if the current processor core receives the abnormal information of the lower processor core, the current processor core will go to the specified memory address of the lower processor to read the power-on self-checking result, and the power-on self-checking result is an intermediate result at this time, and the abnormal code is stored. After the current processor core reads the abnormal code, the current processor core actively performs abnormal repair on the lower processor core, and after the abnormal repair, the current processor core informs the lower processor core to continuously execute the power-on self-test. If the lower processor continues to power on and self-check, the same abnormal event still occurs, the current processor core is informed again, the current processor core can repair the abnormal event of the lower processor core again, and meanwhile, the abnormal repair times can be accumulated for repairing the current processor core each time. And repeating the steps until the number of the abnormal repair times exceeds a first threshold value, and directly turning off the power supply of the lower processor core to power off the lower processor core without repeatedly repairing the abnormal of the lower processor core by the current processor core. Meanwhile, the current processor core determines that the lower processor core is a processor core that starts an exception. Further, after the current processor core turns off the power of the abnormal lower processor core, the turn-off code of the lower processor core is written into the designated memory address to update the boot record information. And if the current processor core has the upper processor core, reporting the upper processor core after updating the startup record information so as to report the abnormal processing state of the lower processor core after power failure to the upper processor core.
The embodiment actively screens abnormal events by using the abnormal working mode of the lower-level processor cores and informs the high-level processor cores through an inter-core communication mechanism. The upper-level processor core counts the normal starting number and the abnormal starting number of the lower-level processor core which is responsible for management, and finally the highest-level processor core counts the number of all the normal starting processor cores and the abnormal starting processor cores.
According to the embodiment, the current processor core performs abnormal repair on the lower processor core through multiple attempts, and the power of the lower processor core which is not repaired successfully is cut off after the number of attempts exceeds the first threshold, so that the repair function of the upper processor core to the lower processor core is realized, the resource consumption of the upper processor core for infinite endless loop repair is avoided, and the balance between repair and resource use is realized.
In one embodiment, exception repair for a lower processor core includes:
performing first firmware verification on the lower processor core, resetting the lower processor core if the first firmware verification is passed, and performing power-on self-test after reloading the firmware after resetting the lower processor core;
if the first firmware is checked to be wrong, the lower processor core is controlled to reload the firmware and then power-on self-checking is executed;
Accumulating the number of times of abnormality repair;
if the lower processor core still generates an abnormal event after the firmware is reloaded, judging that the abnormal repair fails, and if the lower processor core does not generate the abnormal event after the firmware is reloaded, judging that the abnormal repair is successful.
Specifically, the processor core may have an abnormal event caused by an abnormal content of the firmware, a hardware fault, or a design error of the firmware itself.
In this embodiment, the exception repairing step of the current processor core to the lower processor core performs a first firmware check on the firmware of the lower processor core to determine whether the exception event is an exception event caused by an exception of the firmware content. If the first firmware check passes, it is stated that the firmware of the lower processor core has no problem. At this time, the current processor core controls the lower processor core to reset, and the lower processor core can be reset by adopting a hardware reset mode or a software reset mode and the like. After the lower processor core is reset, firmware is automatically reloaded, normal operation is restored, and local power-on self-test is performed.
If the first firmware check does not pass, it is stated that the firmware of the lower processor core is problematic. And the current processor core controls the lower processor core to execute the power-on self-check after reloading the firmware.
If the lower processor core reloads the firmware and then the power-on self-test still generates an abnormal event, the abnormal repair of the current processor core fails, and if the abnormal event does not occur any more, the abnormal repair of the current processor core is successful.
If the exception repair fails and the counted number of exception repair does not exceed the first threshold, the current processor core performs exception repair on the lower processor core again.
If the number of abnormal repair times exceeds a first threshold, determining that the lower processor core is a hardware fault or a firmware design error, wherein the current processor core cannot repair the fault, so that the current processor core directly powers off the lower processor core.
The first threshold is a positive integer greater than or equal to 1, is a software setting value, can be adjusted according to actual needs, and is set when the current processor software is compiled.
The mode of resetting and reloading firmware is the exception repair to the lower processor core.
In another embodiment, the processor core is configured with a WDT (watchdog timer) module, and the WDT module may complete the restart processing of a portion of the exception event. After the processor core is powered on, the processor core software opens the watch dog timer in the WDT module to count, and when the count reaches the set value of the watch dog timer software, the WDT module sends out a reset signal to restart the processor core. When the processor core normally operates, in order to avoid false restarting, the count value of the watchdog counter needs to be cleared in time, and when an abnormal event occurs in the processor core and cannot normally operate, the count value of the watchdog counter reaches the maximum value, and the watchdog is triggered to send out a processor reset signal. This may be reset or self-reset by the processor core during the self-repair process.
In one embodiment, after controlling the powering up of the lower processor core, the method further comprises: timing the power-on self-test of the lower processor core;
the method further includes, prior to receiving the power-on self-test completion notification for the lower processor core: if the timing duration exceeds the preset duration and the power-on self-checking completion notification of the lower processor core is not received, reading the power-on self-checking result of the lower processor core from the appointed memory address of the lower processor core, and/or judging that the lower processor core is started abnormally.
Specifically, each processor core may fail to signal inter-core communications due to an exception, and a timeout mechanism may be provided in the processor core in order to avoid long invalid waits.
The current processor core controls the lower processor which is governed by the current processor core to start a timer to time the power-on self-test of the lower processor after the power-on of the lower processor, if the time length exceeds the preset time length, namely the time comes, the current processor core does not receive the power-on self-test completion notification sent by the lower processor core, the current processor core can generate a timeout event, and actively reads the power-on self-test result of the memory address appointed by the lower processor core, or the current processor core directly judges that the lower processor core is abnormal in starting. The current processor core can also perform power-off processing on the lower processor core judged to be abnormal in starting, write a closing code into a designated memory address, and report the abnormal processing state to the upper processor core through an inter-core communication mechanism.
According to the embodiment, by setting the timeout mechanism, the situation that the power-on self-test result cannot be timely reported and transmitted due to abnormal communication of the processor cores is avoided, so that the upper processor cores can timely acquire the power-on self-test result of the lower processor cores.
In one embodiment, after notifying the upper processor core after obtaining the functional self-test result, the method further comprises:
if a normal starting instruction is received, a function self-checking program enters a normal program, wherein the normal starting instruction is sent by a highest-level processor core through an upper-level processor core after the function self-checking result of the current processor core is determined to be correct;
before reporting the upper processor core after updating the function record information, the method further comprises:
if the function self-checking result of the lower processor core is wrong, determining whether the task dispatching times exceeds a second threshold value, if the task dispatching times do not exceed the second threshold value, dispatching the corresponding function self-checking task to the lower processor core again,
if the second threshold value is exceeded, controlling the lower processor core to finish the function self-checking and/or power-off;
reporting the upper processor core after updating the function record information, comprising: if the function record information exceeds the second threshold value or the function self-checking result of the lower processor core is correct, reporting the function record information to the upper processor core after updating the function record information.
Specifically, the functional self-checking result is transferred to the highest-level processor core from bottom to top, and the highest-level processor core can count all the processor cores and the number of the processor cores with normal power-on self-checking and the processor cores with abnormal power-on self-checking, and can also count all the processor cores and the number of the processor cores with normal functional self-checking and the processor cores with abnormal functional self-checking.
After the processor core with normal function self-checking is determined, the highest-level processor core sends a normal starting instruction to the lower-level processor core, the normal starting instruction is transmitted downwards layer by layer, each processor core with normal function self-checking can receive the normal starting instruction, and the normal program is entered by the function self-checking program so as to exit the self-checking program and enter a normal operation link.
In order to reduce misjudgment, when the current processor core judges that the function self-checking result of the lower processor core is wrong, the function self-checking task is sent to the lower processor core again, so that the lower processor core tries the function self-checking again. And if the accumulated task dispatch times exceed a second threshold value, finally determining that the function self-checking of the lower processor core is abnormal, and controlling the lower processor core to be powered off or ending the function self-checking.
In one embodiment, performing the local power-on self-test includes at least one of:
the state of the local register is checked and,
checking the read-write function of the local register,
checking the read-write function of the local memory,
the pin status of each local pin is checked.
Specifically, each processor core can check the register state, the read-write capability, the memory read-write capability and the pin state of the processor core one by one in the power-on self-checking process, and the power-on self-checking can ensure that the finally available processor core is in a good working state and can be used for later processing tasks.
Of course, the power-on self-checking content of the processor is not limited to checking the functions of register state, pin state, memory operation and the like.
In one embodiment, after controlling the powering up of the lower processor core, the method further comprises:
and performing second firmware verification on the firmware loaded by the lower processor core, and if the second firmware verification fails, controlling the lower processor to reload the firmware until the second firmware verification passes.
Specifically, each processor core will load corresponding firmware to the local memory after being powered on, in order to find the abnormal processor core as soon as possible, the current processor core may perform a second firmware check on the firmware loaded by the lower processor core under jurisdiction, if the second firmware check fails, the lower processor core is controlled to reload the firmware, and then the second firmware check is performed on the firmware reloaded by the lower processor core, so that the loop is performed until the second firmware check passes.
Or counting the times of reloading firmware by the lower processor core, and if the loading times exceed a third threshold value, determining that the lower processor core fails and performing power-off processing on the lower processor core as soon as possible.
And allowing the lower processor core passing the second firmware verification to enter a power-on self-checking program to perform power-on self-checking.
In one embodiment, the processor cores communicate via an inter-core communication mechanism. The communication of the processor cores may employ one or more mechanisms concurrently including, but not limited to, processor pin interrupts, serial communications, shared memory, semaphores, etc. Taking interrupt+shared memory as an example, the process of inter-core communication between processor core a and processor core B is as follows:
the processor core A applies for lock shared memory, the processor core A modifies memory data according to a power-on self-checking result or a function self-checking result or an acquired power-on self-checking result or a function self-checking result of a lower processor core and then releases lock, the processor core A sends an interrupt notification to the processor core B, and the processor core B acquires shared memory data from the shared memory after entering an interrupt program. Thus, the sharing or reporting of the power-on self-checking result can be realized.
In one embodiment, the multi-core processor is connected to a HOST PC, which is the motherboard of the computer device, through a PCIe connector. The HOST PC includes a PCIe controller for managing and controlling the PCIe connector, and further includes a memory DDR for storing firmware of the processor core.
The main processor core loads firmware from FLASH, and other processor cores acquire corresponding firmware from memory DDR of HOST PC through PCIe connectors.
PCIe is an abbreviation for PCIe: PCI-Express (peripheral component interconnect express), a high speed serial computer expansion bus standard. It was proposed by intel in 2001 to replace the old PCI, PCI-X and AGP bus standards. The formal PCIe standard is issued by PCI-SIG (PCI special interest group) certification, and the most popular version that has been issued formally is PCIe4.0 and the latest version is PCIe5.0. The interface is used to connect high-speed components to Host PCs such as video cards, wi-Fi cards, sound cards and even SSDs. PCIe slots have different physical configurations depending on the number of bi-directional lanes connected to them: x1, x4, x8, x16, x32.
The PCIe initialization process is:
after the system is powered on, the main bridge device and the root bus are found, all sub-bridges and PCI devices under the root bus are traversed, all PCI devices under the sub-bridges are traversed, all device BAR spaces are initialized according to the total system resources, and all bridge device register spaces are initialized according to the total system resources.
The present application also provides a computer readable storage medium storing a computer program which, when executed by a processor core, causes the processor core to perform the steps of the power-on detection method of the multi-core processor of any one of the above.
The method and the device solve the problem of power-on self-check of the multi-core processor such as a graphic processor, can count the number of the processor cores with normal starting and abnormal starting, and confirm that the processor cores can run a normal graphic calculation program. And processing the graph calculation self-checking task through a processor core program, counting the number of processors which start normal and abnormal, and realizing power-on self-checking.
The multi-core processor checks the states of the registers, the memory operation and the pins of the processor core program, counts the number of the processor cores which are started normally and abnormally, and realizes power-on self-checking. And processing the graph calculation self-checking task through a processor core program, counting the number of the processor cores for starting the normal and abnormal states, and realizing the function self-checking. And the abnormal event during power-on is processed through the processor core program, and the power-on self-checking result and the functional self-checking result are reported by matching with an inter-core communication mechanism. Instruction level exception handling may itself handle some exception events, such as divide by zero events, memory page misses, etc. The exception handling of the low-level processor is completed by the high-level processor, and the combination of self-repair and request repair is realized by powering off, restarting by soft reset, restarting by brushing firmware and the like.
In one embodiment, a computer readable storage medium is provided, storing a computer program which, when executed by a processor core, causes the processor core to perform the steps of:
after power-on, loading corresponding firmware into a local memory, starting and executing local power-on self-test after the firmware is loaded, updating a local power-on self-test result, and if a superior processor core exists, notifying the superior processor core after the local power-on self-test is completed;
if the local power-on self-checking result indicates that the starting is normal and the current processor core manages the lower processor core, controlling the managed lower processor core to power on;
after receiving the power-on self-checking completion notification of the lower processor core, reading the power-on self-checking result of the lower processor core, updating starting record information for representing the starting state of the managed lower processor core according to the power-on self-checking result of the lower processor core, and reporting the upper processor core after updating is completed if the upper processor core exists.
The application also provides a GPU, wherein the GPU is integrated with a plurality of processor cores, and the processor cores execute power-on detection according to the method of any one of the previous claims when the GPU is powered on.
The graphics processor GPU processes the power-on self-checking task and the graphics calculation self-checking task through a processor kernel program, counts the number of processors which start normal and abnormal, and realizes power-on detection.
Those skilled in the art will appreciate that the processes implementing all or part of the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, and the program may be stored in a non-volatile computer readable storage medium, and the program may include the processes of the embodiments of the methods as above when executed. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples represent only a few embodiments of the present application, which are described in more detail and are not thereby to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (11)

1. A power-on detection method of a multi-core processor, applied to a current processor core, the method comprising:
after power-on, loading corresponding firmware into a local memory, starting and executing local power-on self-test after the firmware is loaded, updating a local power-on self-test result, and if a superior processor core exists, notifying the superior processor core after the local power-on self-test is completed;
If the local power-on self-checking result indicates that the starting is normal and the current processor core manages the lower processor core, controlling the managed lower processor core to power on;
after the lower processor core is powered on and before a power-on self-checking completion notification of the lower processor core is received, if an abnormal code of the lower processor core is read after abnormal information of the lower processor core is received, performing abnormal repair on the lower processor core; if the abnormal repair fails, determining whether the abnormal repair times exceed a first threshold value, and if the abnormal repair times do not exceed the first threshold value, carrying out abnormal repair on the lower processor core again; if the first threshold value is exceeded, performing power-off processing on the lower processor core; if the upper processor core exists, reporting an abnormal processing state of the lower processor core with power off to the upper processor core;
and if the upper processor core exists, reporting the upper processor core after updating is completed.
2. The method of claim 1, wherein after the local power-on self-test is completed, the method further comprises:
if the local power-on self-checking result indicates that the function self-checking task distributed by the upper processor core is received after the starting is normal, the function self-checking program is started to execute the received function self-checking task, and the upper processor core is notified after the function self-checking result is obtained;
after reading the power-on self-test result of the lower processor core, the method further comprises:
if the power-on self-checking result of the lower processor core indicates that the starting is normal, a corresponding functional self-checking task is distributed to the lower processor core;
and after receiving the function self-checking completion notification of the lower processor core, reading the function self-checking result of the lower processor core, checking whether the function self-checking result of the lower processor core is correct, updating function record information used for representing whether the managed lower processor core functions normally or not according to the obtained checking result, and reporting the upper processor core after updating the function record information.
3. The method of claim 1, wherein the enabling and performing a local power-on self test comprises:
If an abnormal event occurs in the local power-on self-checking process, the power-on self-checking is stopped, the self-checking progress is recorded, then the power-on self-checking is switched to an abnormal working mode, and the self-repairing is started and executed;
if the self-repairing fails and the upper processor core exists, storing an abnormal code corresponding to the abnormal with the failed repairing as a local power-on self-checking result, and reporting the abnormal to the upper processor core to request the upper processor to check the local abnormal to repair the abnormal;
if the self-repairing is successful or the abnormal repairing is successful, the abnormal working mode is exited, and the local power-on self-inspection is continuously executed according to the self-inspection progress.
4. The method of claim 1, wherein if the current processor core includes at least two highest-level processor cores and the highest-level processor cores are simultaneously powered up and power on self-test, the powering up and performing the local power on self-test comprises:
if an abnormal event occurs in the local power-on self-checking process, the power-on self-checking process is stopped, the self-checking progress is recorded, then the power-on self-checking process is switched to an abnormal working mode, other highest-level processor cores are notified through an inter-core communication mechanism, and the other highest-level processor cores which do not have the abnormal event are requested to check the abnormal local highest-level processor cores for performing abnormal repair;
If the exception repair is successful by the other highest-level processor cores which do not generate the exception, the exception working mode is exited, and the local power-on self-test is continuously executed according to the self-test progress.
5. The method of claim 1, wherein the exception-repairing the lower processor core comprises:
performing first firmware verification on the lower processor core, resetting the lower processor core if the first firmware verification is passed, and performing power-on self-test after reloading the firmware after resetting the lower processor core;
if the first firmware is checked to be wrong, the lower processor core is controlled to reload the firmware and then power-on self-checking is executed;
accumulating the number of times of abnormality repair;
if the lower processor core still generates an abnormal event after the firmware is reloaded, judging that the abnormal repair fails, and if the lower processor core does not generate an abnormal event after the firmware is reloaded, judging that the abnormal repair is successful.
6. The method of claim 1, wherein after controlling the powering up of the lower processor core, the method further comprises: timing a power-on self-test of the lower processor core;
Before receiving the power-on self-test completion notification of the lower processor core, the method further includes: if the timing duration exceeds the preset duration and the power-on self-checking completion notification of the lower processor core is not received, reading the power-on self-checking result of the lower processor core from the appointed memory address of the lower processor core, and/or judging that the lower processor core is abnormal in starting.
7. The method of claim 2, wherein after notifying the upper processor core after obtaining the functional self-test result, the method further comprises:
if a normal starting instruction is received, the function self-checking program enters a normal program, wherein the normal starting instruction is sent by a highest-level processor core through the upper-level processor core after the function self-checking result of the current processor core is determined to be correct;
before reporting the upper processor core after updating the function record information, the method further comprises:
if the function self-checking result of the lower processor core is wrong, determining whether the task dispatching times exceeds a second threshold value, if the task dispatching times does not exceed the second threshold value, dispatching the corresponding function self-checking task to the lower processor core again,
If the second threshold value is exceeded, controlling the lower processor core to finish function self-checking and/or power-off;
and reporting the function record information after updating the function record information to the upper processor core, wherein the method comprises the following steps of: if the function record information exceeds the second threshold value or the function self-checking result of the lower processor core is correct, reporting the upper processor core after updating the function record information.
8. The method of claim 1, wherein performing local self-tests comprises at least one of:
the state of the local register is checked and,
checking the read-write function of the local register,
checking the read-write function of the local memory,
the pin status of each local pin is checked.
9. The method of claim 1, wherein after controlling the powering up of the lower processor core, the method further comprises:
and performing second firmware verification on the firmware loaded by the lower processor core, and if the second firmware verification fails, controlling the lower processor core to reload the firmware until the second firmware verification passes.
10. A computer readable storage medium storing a computer program, which when executed by a processor core in a multi-core processor causes the processor core to perform the steps of the method according to any of claims 1 to 9.
11. A GPU, characterized in that it integrates a plurality of processor cores, which perform power-up detection by the processor cores according to the method of any of claims 1 to 9 at power-up start-up of the GPU.
CN202310028815.XA 2023-01-09 2023-01-09 Power-on detection method of multi-core processor, readable storage medium and GPU Active CN115904850B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310028815.XA CN115904850B (en) 2023-01-09 2023-01-09 Power-on detection method of multi-core processor, readable storage medium and GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310028815.XA CN115904850B (en) 2023-01-09 2023-01-09 Power-on detection method of multi-core processor, readable storage medium and GPU

Publications (2)

Publication Number Publication Date
CN115904850A CN115904850A (en) 2023-04-04
CN115904850B true CN115904850B (en) 2023-05-12

Family

ID=85748278

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310028815.XA Active CN115904850B (en) 2023-01-09 2023-01-09 Power-on detection method of multi-core processor, readable storage medium and GPU

Country Status (1)

Country Link
CN (1) CN115904850B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996087A (en) * 2010-12-02 2011-03-30 北京星河亮点通信软件有限责任公司 Dynamical loading system and method for multi-core processor array program
CN102880536A (en) * 2012-09-07 2013-01-16 杭州中天微系统有限公司 JTAG (joint test action group) debug method of multi-core processor
CN115048258A (en) * 2021-03-09 2022-09-13 超聚变数字技术有限公司 Monitoring method and monitoring device for processor load

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104461502A (en) * 2014-11-03 2015-03-25 广州汇讯营销咨询有限公司 Task management method and system based on Hadoop
DE102020209228A1 (en) * 2020-07-22 2022-01-27 Robert Bosch Gesellschaft mit beschränkter Haftung Method for monitoring at least one computing unit
CN115469912B (en) * 2022-11-02 2023-01-24 中国人民解放军国防科技大学 Heterogeneous real-time information processing system design method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996087A (en) * 2010-12-02 2011-03-30 北京星河亮点通信软件有限责任公司 Dynamical loading system and method for multi-core processor array program
CN102880536A (en) * 2012-09-07 2013-01-16 杭州中天微系统有限公司 JTAG (joint test action group) debug method of multi-core processor
CN115048258A (en) * 2021-03-09 2022-09-13 超聚变数字技术有限公司 Monitoring method and monitoring device for processor load

Also Published As

Publication number Publication date
CN115904850A (en) 2023-04-04

Similar Documents

Publication Publication Date Title
US10761139B2 (en) Semiconductor apparatus and diagnostic test method
US9542195B1 (en) Motherboards and methods for BIOS failover using a first BIOS chip and a second BIOS chip
CN101814035B (en) Method and system to enable fast platform restart
US9449717B2 (en) Memory built-in self-test for a data processing apparatus
US10585755B2 (en) Electronic apparatus and method for restarting a central processing unit (CPU) in response to detecting an abnormality
US20060236150A1 (en) Timer-based apparatus and method for fault-tolerant booting of a storage controller
US20080005616A1 (en) Systems and methods for CPU repair
US20070150713A1 (en) Methods and arrangements to dynamically modify the number of active processors in a multi-node system
CN109032822B (en) Method and device for storing crash information
US8060737B2 (en) Method and apparatus for preventing BIOS from failing to enter boot program
US7363544B2 (en) Program debug method and apparatus
US7194614B2 (en) Boot swap method for multiple processor computer systems
CN114968382A (en) Method and system for preventing shutdown and BIOS chip
US20020095625A1 (en) Identifying field replaceable units responsible for faults detected with processor timeouts utilizing IPL boot progress indicator status
US7607038B2 (en) Systems and methods for CPU repair
US7917804B2 (en) Systems and methods for CPU repair
US8060778B2 (en) Processor controller, processor control method, storage medium, and external controller
CN115904850B (en) Power-on detection method of multi-core processor, readable storage medium and GPU
US7673171B2 (en) Systems and methods for CPU repair
US20130318310A1 (en) Processor processing method and processor system
US7533293B2 (en) Systems and methods for CPU repair
US7694174B2 (en) Systems and methods for CPU repair
CN113867753B (en) Firmware updating method and system of server
US8661289B2 (en) Systems and methods for CPU repair
US11099838B1 (en) Method and system for recovery for custom integrated circuit

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant