CN111858196A

CN111858196A - Computing unit detection method, parallel processor and electronic equipment

Info

Publication number: CN111858196A
Application number: CN202010540012.9A
Authority: CN
Inventors: 袁庆
Original assignee: Haiguang Information Technology Co Ltd
Current assignee: Haiguang Information Technology Co Ltd
Priority date: 2020-06-12
Filing date: 2020-06-12
Publication date: 2020-10-30

Abstract

The application relates to a computing unit detection method, a parallel processor and electronic equipment, and belongs to the technical field of computers. The method is applied to a parallel processor which comprises a computing unit group and a detection unit, and comprises the following steps: respectively inputting thread data corresponding to the same instruction into a detected computing unit and a detection unit for operation; wherein the detected computing unit is a computing unit in the computing unit group; and detecting the detected computing unit according to the output result of the detecting unit and the output result of the detected computing unit to obtain a detection result. In the embodiment of the application, the detection unit is added to hardware to detect the calculation unit in the calculation unit group, so that physical damage of the calculation unit can be detected in the software running process, and meanwhile, when the detection is carried out, the input data of the detected calculation unit is the same as the data of the detection unit, and the reliability of the detection result is ensured.

Description

Computing unit detection method, parallel processor and electronic equipment

Technical Field

The application belongs to the technical field of computers, and particularly relates to a computing unit detection method, a parallel processor and electronic equipment.

Background

With the development of Artificial Intelligence (AI), big data, etc., higher requirements are put on the computing power of the processor. The advent of parallel processors, such as Graphics Processing Units (GPUs) and various AI parallel processors, just meets the needs in this regard. Generally speaking, there are hundreds or thousands of parallel computing units in a parallel processor, which are independent of each other and cooperate to jointly complete distributed computing tasks. However, because of the numerous computing units of the parallel processor, the number of chip transistors is usually tens of times that of a Central Processing Unit (CPU), and the probability of physical MOS transistor damage is also increased proportionally. If such problems are found in the testing stage, the yield is only reduced to a certain extent, and no great loss is caused. However, if MOS damage occurs during the operation of the parallel processor, the calculation result of the whole calculation unit may be erroneous, and further the whole calculation may be erroneous, and the erroneous behavior can only be found in the next self-checking process.

A common solution to such problems is to perform a self-test. The self-checking is that self-checking logic is added at the idle stage of chip power-on or program operation to complete the check of the computing unit and ensure the normal work of the computing unit. However, the self-checking strategy cannot check physical damage occurring in the normal operation process of the software, and cannot correct the problem.

Disclosure of Invention

In view of this, an object of the present application is to provide a method for detecting a computing unit, a parallel processor, and an electronic device, so as to solve the problem that the existing solution cannot detect physical damage of the computing unit during the software running process.

The embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a computing unit detection method, which is applied to a parallel processor, where the parallel processor includes a computing unit group and a detection unit, and the method includes: respectively inputting thread data corresponding to the same instruction into a detected computing unit and the detection unit for operation, wherein the detected computing unit is a computing unit in the computing unit group; and detecting the detected computing unit according to the output result of the detecting unit and the output result of the detected computing unit to obtain a detection result. In the embodiment of the application, the detection unit is added to hardware to detect the calculation unit in the calculation unit group, so that physical damage of the calculation unit can be detected in the software running process, and meanwhile, when the detection is carried out, the input data of the detected calculation unit is the same as the data of the detection unit, and the reliability of the detection result is ensured.

With reference to a possible implementation manner of the embodiment of the first aspect, when the detection result indicates that the output result of the detected computing unit is inconsistent with the output result of the detection unit, the method further includes: and pausing input of a thread corresponding to the instruction after the moment of the instruction, and determining whether the detected computing unit or the detecting unit is abnormal. In the embodiment of the application, when the output result of the detected computing unit represented by the detection result is inconsistent with the output result of the detection unit, the thread operation after the moment of the current thread is suspended, so that more operation errors are reduced, and meanwhile, whether the detected computing unit or the detection unit is abnormal is further determined, so that different strategies are adopted for correction.

With reference to one possible implementation manner of the embodiment of the first aspect, the determining whether the detected computing unit or the detecting unit is abnormal includes: respectively inputting the corresponding thread data into each computing unit in the computing unit group and the detection unit to carry out operation when the detection results are inconsistent; and determining whether the detected computing unit is abnormal or the detecting unit is abnormal according to the output result of each computing unit and the output result of the detecting unit, wherein when more than half of the computing units are consistent with the output result of the detecting unit, the detected computing unit is determined to be abnormal. In the embodiment of the application, when the detected computing unit or the detecting unit is determined to be abnormal, each computing unit and each detecting unit in the computing unit group are used for computing the same thread data, and if more than half of the output results of the computing units are consistent with the output results of the detecting units, the detected computing units are determined to be abnormal, so that the accuracy of the judgment result is ensured.

With reference to one possible implementation manner of the embodiment of the first aspect, the determining whether the detected computing unit or the detecting unit is abnormal includes: respectively inputting the corresponding thread data into each computing unit in the computing unit group and the detection unit to carry out operation when the detection results are inconsistent; when the output results of more than half of the computing units are consistent with the output result of the detection unit, respectively inputting the thread data into each computing unit in the computing unit group and the detection unit again for operation; and determining whether the detected computing unit or the detecting unit is abnormal according to the output result of each computing unit which is operated again and the output result of the detecting unit which is operated again, wherein if the output result of more than half of the computing units which are operated again is consistent with the output result of the detecting unit, the detected computing unit is determined to be abnormal. In the embodiment of the application, when determining whether the detected computing unit or the detecting unit is abnormal, the same thread data is respectively input into each computing unit and the detecting unit in the computing unit group for operation, if more than half of the output results of the computing units are consistent with the output results of the detecting unit, the thread data is respectively input into each computing unit in the computing unit group and the detecting unit again for operation, and when the result of the operation again is still more than half of the output results of the computing units and the output results of the detecting unit are consistent, the detected computing unit is determined to be abnormal, so that error influence caused by operation errors of a software program is eliminated, and the accuracy of the determination result is further improved.

With reference to one possible implementation manner of the embodiment of the first aspect, determining whether the detected computing unit or the detecting unit is abnormal further includes: when more than half of the output results of the computing units are inconsistent with the output results of the detection units, updating the times of inconsistent output results, wherein if the more than half of the output results of the computing units are consistent with the output results of the detection units, the times are cleared; and when the updated times are equal to a preset threshold value, determining that the detection unit is abnormal. In the embodiment of the application, when more than half of the output results of the calculation units are inconsistent with the output results of the detection units, the detection units are not directly judged to be abnormal, the times of continuous inconsistency of the detection units are counted, and the detection units are determined to be abnormal only when the times of inconsistency are equal to the preset threshold value, so that the error influence caused by running errors of software programs is eliminated, and the accuracy of the judgment results is further improved.

With reference to one possible implementation manner of the embodiment of the first aspect, after determining that the detected computing unit is abnormal, the method further includes: setting the detected computing unit to be in an unavailable state, and backtracking to a time node of a thread needing to be recalculated; and redistributing the threads corresponding to all the instructions which need to be recalculated after the time node to other calculating units in the calculating unit group except the detected calculating unit for operation. In the embodiment of the application, after the detected computing unit is determined to be abnormal, the detected computing unit is set to be in an unavailable state, a time node of a thread needing to be recalculated is traced back, and the threads corresponding to all instructions needing to be recalculated after the time node are redistributed to other computing units in the computing unit group except the detected computing unit for operation, so that operation errors caused by the abnormal detected computing unit are eliminated.

With reference to one possible implementation manner of the embodiment of the first aspect, after determining that the detected computing unit is abnormal, the method further includes: setting the detected computing unit to be in an unavailable state, setting the detecting unit to be a computing unit to replace the detected computing unit, and tracing back to a time node of a thread to be recalculated; and redistributing the threads corresponding to all the instructions which need to be recalculated after the time node to other calculating units except the detected calculating unit in the calculating unit group and the detecting unit for operation. In the embodiment of the application, after the detected computing unit is determined to be abnormal, the detected computing unit is set to be in an unavailable state, the detecting unit is set as the computing unit to replace the detected computing unit, a time node needing to recalculate a thread is traced back, the threads corresponding to all instructions needing to be recalculated after the time node are redistributed to other computing units and detecting units in the computing unit group except the detected computing unit for operation, so that operation errors caused by the abnormal detected computing unit are eliminated, and meanwhile, the number of the original computing units is unchanged because the abnormal detected computing unit is replaced by the detecting unit, and operation logic does not need to be changed.

With reference to a possible implementation manner of the embodiment of the first aspect, when detecting the computing units in the computing unit group, polling detection is performed on each computing unit in the computing unit group according to a preset order. In the embodiment of the application, when the computing units in the computing unit group are detected, each computing unit in the computing unit group is detected in a polling mode, so that the computing units with physical damage can be found in time, and meanwhile, the control logic is simplified.

In a second aspect, embodiments of the present application further provide a parallel processor, including a SIMD architecture, including: calculating a unit group; a detection unit; a control unit; a scheduling unit and an arbitration unit; the scheduling unit is used for respectively inputting the thread data corresponding to the same instruction into a detected computing unit and the detection unit to carry out operation under the control of the control unit, wherein the detected computing unit is a computing unit in the computing unit group; and the arbitration unit is used for detecting the detected computing unit according to the output result of the detection unit and the output result of the detected computing unit under the control of the control unit to obtain a detection result.

With reference to a possible implementation manner of the embodiment of the second aspect, the control unit is further configured to, when the detection result indicates that the output result of the detected computing unit is inconsistent with the output result of the detection unit, suspend input of a thread corresponding to an instruction after the time when the instruction is located, and determine whether the detected computing unit or the detection unit is abnormal.

With reference to a possible implementation manner of the embodiment of the second aspect, the control unit is further configured to control the scheduling unit to input, to each computing unit in the computing unit group and the detecting unit, the corresponding thread data when the detection results are inconsistent, so as to perform an operation; and further for determining whether the detected computing unit or the detecting unit is abnormal, based on an output result of each computing unit and an output result of the detecting unit, wherein the detected computing unit is determined to be abnormal when more than half of the output results of the computing units are identical to the output result of the detecting unit.

With reference to a possible implementation manner of the embodiment of the second aspect, the control unit is further configured to control the scheduling unit to input, to each computing unit in the computing unit group and the detecting unit, the corresponding thread data when the detection results are inconsistent, so as to perform an operation; when the output results of more than half of the computing units are consistent with the output results of the detection units, controlling the scheduling unit to input the thread data into each computing unit in the computing unit group and the detection units respectively for operation; and determining whether the detected computing unit or the detecting unit is abnormal according to the output result of each computing unit which is operated again and the output result of the detecting unit which is operated again, wherein if the output result of more than half of the computing units which are operated again is consistent with the output result of the detecting unit, the detected computing unit is determined to be abnormal.

With reference to a possible implementation manner of the embodiment of the second aspect, the control unit is further configured to update the number of times of inconsistency of the output result when more than half of the output results of the calculation units are consistent with the output result of the detection unit, where the number of times is cleared if more than half of the output results of the calculation units are consistent with the output result of the detection unit; and further configured to determine that the detection unit is abnormal when the updated number of times is equal to a preset threshold.

In combination with one possible implementation manner of the embodiment of the second aspect, the SIMD architecture further includes: a storage unit for storing an input instruction block; the control unit is also used for setting the detected computing unit to be in an unavailable state after determining that the detected computing unit is abnormal, and backtracking to a time node needing to recalculate the thread; and reallocating threads corresponding to all the instructions which need to be recalculated after the time node to other computing units in the computing unit group except the detected computing unit for operation based on the instruction block stored in the storage unit.

In combination with one possible implementation manner of the embodiment of the second aspect, the SIMD architecture further includes: a storage unit for storing an input instruction block; the control unit is also used for setting the detected computing unit to be in an unavailable state after determining that the detected computing unit is abnormal, setting the detecting unit to be the computing unit to replace the detected computing unit, and tracing back to a time node needing to recalculate the thread; and reallocating threads corresponding to all the instructions which need to be recalculated after the time node to other computing units in the computing unit group except the detected computing unit for operation based on the instruction block stored in the storage unit.

In a third aspect, an embodiment of the present application further provides an electronic device, including: the foregoing embodiments of the second aspect and/or a parallel processor provided in connection with any possible implementation of the embodiments of the second aspect.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts. The above and other objects, features and advantages of the present application will become more apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the drawings. The drawings are not intended to be to scale as practical, emphasis instead being placed upon illustrating the subject matter of the present application.

Fig. 1 shows a schematic diagram of a current parallel processor.

Fig. 2 shows a schematic diagram of the operation of MAC in the current SIMD architecture.

FIG. 3 shows a schematic diagram of a SIMD architecture provided by an embodiment of the present application.

Fig. 4 shows a flowchart of a computing unit detection method provided in an embodiment of the present application.

Fig. 5 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, relational terms such as "first," "second," and the like may be used solely in the description herein to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Further, the term "and/or" in the present application is only one kind of association relationship describing the associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone.

In order to solve the problem that the existing scheme cannot detect physical damage of a computing unit in the software running process, in the embodiment of the present application, a computing unit (MAC) is added in each original parallel computing unit group to serve as a detection unit, and a related storage unit, an arbitration unit and a control unit are added, so that a certain hardware consumption is added on the basis of the original hardware structure, the computing unit with physical damage is discovered in time, and spontaneous correction and repair can be ensured when the computing unit has physical damage, so as to avoid operation errors. For ease of understanding, taking the architecture of the parallel processor shown in fig. 1 as an example, the parallel processor includes 4 mutually independent Shader Engines (SE), SE0, SE1, SE2, and SE3, respectively. Each SE includes 16 CUs (computing units), CU0, CU1, … …, CU14, and CU 15. Each CU contains 4 Single Instruction Multiple Data (SIMD) architectures, SIMD0, SIMD1, SIMD2, SIMD3, respectively. Each SIMD contains 16 MACs, each MAC performing a basic multiply-add operation, which is the most basic computational unit (MAC). Each SE corresponds to a Work Group (Work Group), each SIMD corresponds to an instruction block (Wave), one Wave corresponds to 64 threads (thread), the whole Wave operation can be completed by dividing into 4 stages (Phase), and each MAC corresponds to one thread.

Fig. 2 is a schematic diagram of the operation of the MAC in a typical SIMD architecture, which simplifies functional discussion such as data reading. The Wave information is input into a thread scheduling unit (thread _ dispatch) at the front end for allocation, during allocation, 16 thread data in the Wave information are input into 16 MACs in parallel for operation at one time, and the whole Wave operation can be completed through 4 stages. And the operation result reaches MAC _ OUT for completing the selection and output of the MAC calculation data, so that fifo (first-in first-OUT) operation is carried OUT and finally output. For the present application, since determination and repair of physical damage need to be completed, and therefore a part of hardware modules need to be additionally added, in the embodiment of the present application, a MAC unit for detection, and an associated storage unit (Wave _ fifo), arbitration unit (arbitration) and control unit (read _ control) are added in each SIMD architecture to implement the function of detection logic, and a schematic diagram of the modified SIMD architecture is shown in fig. 3. Of course, in an embodiment, one MAC unit may be selected from the original 16 MAC units as the detection unit without additionally increasing the MAC unit, but only in this embodiment, the original running logic needs to be modified, that is, the number of threads completed in each phase is reduced, and one Wave needs 5 phases to complete. In the present application, the number of initial computing units (MACs) is 16 as an example, but in one embodiment, the default number of initial computing units (MACs) may not be 16.

The added MAC unit is used as a detection unit for completing the detection of 16 calculation units in the original calculation unit group. The added MAC unit is equivalent to the original 16 MAC units, i.e. the processing function of the thread is the same, and which of the 17 MAC units is configured as a detection unit through a register. When an application with a short period and low accuracy requirement is operated, the newly added MAC unit can be used as a computing unit to compute the application together with the original computing unit so as to increase the computing capability of the chip, for example, when the inspection and repair are not needed, the processing capability of one Wave exceeding 64 threads is supported in time, and the capability is improved to 68 threads. At this time, Wave _ fifo, arbs 0 to 16, arb _ s are all enabled and disabled (disable). When the application with short period and high accuracy requirement is carried out, the newly added MAC unit is used as a detection unit. When long-term applications with high accuracy requirements are carried out, the computing unit is present as a detection unit.

An added control unit (thread _ control) which is a core module of a newly added function and is used for determining which one of the 17 MAC units is used as a detection unit according to the configuration of a static register; and the functions of finishing the control of the arbitration units (arb 0-16) and finishing the control of the fifo control module (arb _ s), setting the damaged computing unit to be in an unavailable state when detecting that the computing unit is damaged, tracing back to a time node needing to recalculate the thread, performing re-distribution calculation on all subsequent threads by using Wave information stored in the storage unit, outputting a stall signal to prevent inputting new Wave, and the like. In addition, in the detection process, the control unit also records the polling detection condition in real time so as to know which computing unit is detected at the current moment.

And the added arbitration units (arb 0-16) are used for finishing arbitration of the calculation unit and the detection unit under the control of the control unit (thread _ control), and feeding back the comparison result to the control unit (thread _ control) and the fifo control module (arb _ s) for further processing. Each arbitration unit (arb) can receive the outputs of 17 MACs at the same time, select the output data of the corresponding MAC unit and the output data of the detection unit to compare according to the control of the control unit, and feed back the comparison result to the thread _ control and arb _ s. For example, if the detected unit is MAC3 and the detected unit is MAC16, the arb3 unit selects output data of MAC3 from 17 MACs inputted and compares the selected output data with output data of MAC 16. It should be noted that, in order to improve arbitration efficiency, in the embodiment of the present application, one computing unit (MAC) may correspond to one arbitration unit (arb), so as to implement parallel arbitration on 16 computing units (MAC). Since each of these 17 MAC units may be a detection unit, 17 arbitration units are required. When the MAC16 is a detection unit, the MACs 0-15 are calculation units, arb16 is marked as not working, and the rest arbs 0-15 are respectively used for comparing the output data of the MAC unit corresponding to the arbs with the output data of the detection unit. For example, arb0 is used to compare the output data of MAC0 with the output data of MAC16, arb1 is used to compare the output data of MAC1 with the output data of MAC16, arb2 is used to compare the output data of MAC2 with the output data of MAC16, and so on. Of course, the number of arbitration units may also be smaller than 17, for example, only 1 arbitration unit, and in this case, the arbitration unit is configured to compare the output data of each calculation unit with the output data of the detection unit, that is, the output data of MAC0 and the output data of MAC16, the output data of MAC1 and the output data of MAC16, the output data of MAC2 and the output data of MAC16, and so on. The arbitration unit of the above example is therefore not to be construed as limiting the application.

And the added fifo control module (arb _ s) is used for completing the control of the thread storage unit (Wave _ fifo) according to the control of the control unit (thread _ control) so as to realize the storage of a certain amount of Wave information. During normal operation, new Wave information is continuously written into Wave _ fifo, once a computing unit is damaged, a time node needing to recalculate a thread is traced back, and all subsequent threads are recalculated by using the Wave information stored in the Wave _ fifo. Of course, in one embodiment, the function of the fifo control module (arb _ s) may be incorporated into the control unit (thread _ control), and in this case, the storage unit (Wave _ fifo) is directly controlled by the control unit.

And a storage unit (Wave _ fifo) for completing storage of a certain amount of Wave information according to control of the fifo control module (arb _ s), wherein the amount of stored Wave information is determined by a period required for completing MAC polling detection. For example, 16 clocks are needed for polling 16 MACs, each clock corresponds to 16 threads, that is, 16 × 16 — 196 threads are needed, and one Wave information corresponds to 64 threads, so that 4 Wave information needs to be stored.

The process of detecting the computing units in the computing unit group will be described with reference to the schematic diagram shown in fig. 3. At this time, 16 MAC units are opened as a calculation unit, an extra MAC unit is used as a detection unit, for example, default MAC16 is used as a detection unit, arb16 is marked as inactive, and the maximum thread corresponding to one Wave is restored to 64 threads. The input Wave information arrives at the storage unit (Wave _ fifo) and the schedule unit (thread _ dispatch) at the same time. the thread _ dispatch allocates the thread data corresponding to each instruction in the instruction block (Wave) to the corresponding computing unit in the computing unit group for operation according to the control of the control unit (thread _ control), and simultaneously inputs the thread data input into the detected computing unit into the detection unit for operation, namely, the thread data corresponding to the same instruction is respectively input into the detected computing unit and the detection unit for operation, so as to ensure that the input data of the detected computing unit is the same as the input data of the detection unit. It should be noted that, when the thread number in the Wave information is less than 64, a situation that part of the MAC has no input data may occur, and if the detected computing unit has no input data at this time, the scheduling unit may generate the input data to the detected computing unit by itself and copy the input data to the detecting unit, and the input data may be generated randomly or according to a specific function, so as to ensure that the data input of the detecting unit is the same as the data input of the detected computing unit. And if the detected computing unit is MAC0, arb0 selects the output result of MAC0 and the output result of MAC16 from 17 inputs under the control of the control unit, detects the detected computing unit according to the output result of the detection unit (MAC16) and the output result of the detected computing unit (MAC0) to obtain a detection result, and feeds the detection result back to the thread _ control and arb _ s. In one embodiment, as long as the detection result indicates that the output result of the detected computing unit is inconsistent with the output result of the detection unit, the control unit outputs the related register indication signal and outputs a stall signal to prevent new Wave information from being input. Of course, in another embodiment, when the output results of the two detection result indicators are inconsistent, the control unit is further configured to suspend the operation of the thread after the time when the current thread is located (for example, suspend the operation of the threads 16-63), and further determine whether the detected computing unit or the detection unit is abnormal.

When determining whether the detected computing unit or the detecting unit is abnormal, in one embodiment, the control unit controls the scheduling unit to input the corresponding thread data into all the MAC units when the detection results are inconsistent, namely, the thread data are input into each computing unit and each detecting unit in the computing unit group to carry out operation, at this time, each arbitration unit (arb 0-15) compares the output data of the corresponding computing unit with the output data of the detecting unit, and feeds back the detection results to the thread _ control and arb _ s. The control unit is also used for determining whether the detected computing unit or the detecting unit is abnormal according to the output result of each computing unit and the output result of the detecting unit, wherein when the output results of more than half of the computing units are consistent with the output results of the detecting unit, the detected computing unit is considered to be abnormal, and when the output results of more than half of the computing units are not consistent with the output results of the detecting unit, the detecting unit is considered to be abnormal, the relevant register indicating signal is output, and the subsequent computing units are not detected any more.

In another embodiment, when determining whether the detected computing unit or the detecting unit is abnormal, the control unit may control the scheduling unit to input the corresponding thread data into each of the computing units and the detecting unit in the computing unit group for operation when the detection result is inconsistent, and when more than half of the output results of the computing units are consistent with the output result of the detecting unit, the control unit may further control the scheduling unit to input the thread data into each of the computing units and the detecting unit in the computing unit group for operation again, and determine whether the detected computing unit or the detecting unit is abnormal according to the output result of each computing unit and the output result of the detecting unit, wherein if the result of the operation again is still more than half of the output results of the computing units are consistent with the output result of the detecting unit, the detected computing unit is determined to be abnormal so as to reduce the influence of the software flow abnormality on the judgment result.

Likewise, when it is determined that the detection unit is abnormal, in one embodiment, when more than half of the output results of the calculation unit and the output results of the detection unit are inconsistent, it is determined that the detection unit is abnormal. In another embodiment, when the output result of the computing unit that exceeds half of the output result of the detecting unit is consistent with the output result of the computing unit, the control unit updates the number of times of inconsistency of the output result, for example, the counter is incremented by 1, then judges whether the updated number of times of inconsistency reaches a preset threshold, determines that the detecting unit is abnormal when the updated number of times of inconsistency is equal to the preset threshold (for example, 3 times), and clears the number of times of inconsistency if the output result of the computing unit that exceeds half of the output result of the computing unit is consistent with the output result of the detecting unit once. The preset threshold may be set as needed, for example, may be set to a numerical value of 3, 4, 5, etc. Similarly, the half may be set as needed, and the larger the half is set, the more accurate the result is, for example, 14, but the half may be set to any one of values from 9 to 15, and is not limited to 14.

After determining that the detected computing unit is abnormal, the control unit is further configured to set the detected computing unit (e.g., arb0) to an unavailable state, trace back to a time node where a thread needs to be recalculated, indirectly control Wave _ fifo by controlling arb _ s, and reallocate threads corresponding to all instructions needing to be recalculated after the time node to other computing units in the computing unit group except the detected computing unit for operation by using Wave information stored in the storage unit. Of course, in an embodiment, after determining that the detected computing unit is abnormal, the processing unit sets the detected computing unit to be in an unavailable state, sets the detecting unit to be the computing unit to replace the detected computing unit, and traces back to the time node of the thread to be recalculated; and reallocating threads corresponding to all the instructions which need to be recalculated after the time node to other computing units in the computing unit group except the detected computing unit for operation based on the instruction block stored in the storage unit. At this time, the SIMD does not have a self-check function. And when the Wave needing to be recalculated is recalculated and is output, releasing the stall state, allowing new Wave information to be input, and continuing the calculation by the system.

After the detected computing unit is determined to be abnormal, the time node of the thread needing to be recalculated needs to be traced back. If the detected computing unit is the first check, the time node that needs to recalculate the thread is the time starting point for starting the thread operation, for example, when the first polling is performed on each computing unit, because there are 16 computing units, 16 clocks are needed for one polling detection, for example, clk0 detects MAC0, clk1 detects MAC1, clk2 detects MAC2, … …, and if when clk15 detects MAC15, MAC15 is found to be abnormal, the time node that needs to be traced back is clk 0. Similarly, if MAC5 is found to be abnormal when clk5 detects MAC5, the time node to be traced back is clk 0. If the detected computing unit is not detected for the first time, the time node traced back to the thread needing to be recalculated is the time node for detecting the detected computing unit for the last time.

When detecting the computing units in the computing unit group, the detecting unit may detect each computing unit by polling, or may detect a specific computing unit according to the configuration of the register.

After the functions of the modules shown in fig. 3 in the process of detecting the computing units in the computing unit group are introduced, a computing unit detection method applied to a parallel processor provided in the embodiment of the present application will be described below with reference to fig. 4.

Step S101: when detecting the computing units in the computing unit group, respectively inputting the thread data corresponding to the same instruction into the detected computing unit and the detecting unit for operation.

Assuming that the detection unit is MAC16, when the detection needs to be performed on the calculation unit in the calculation unit group assumed to be MAC5, the thread data corresponding to the same instruction are respectively input into the detected calculation unit (in this case, MAC5) and the detection unit (in this case, MAC16) for operation, so as to ensure that the input data of the two are the same. Wherein the detected computing unit is a computing unit in the computing unit group.

In one embodiment, each computing unit in the computing unit group is polled in a preset order when the computing units in the computing unit group are detected. For example, the polling detection is performed in the order of MAC0 to MAC16, but may be performed in the other order, for example, in the order of MAC5 to MAC16, and MAC0 to MAC 4. Of course, the detection of the designated calculation unit may be performed according to the configuration of the register. In the detection process, the polling detection condition is recorded in real time so as to know which computing unit is detected at the current moment.

Step S102: and detecting the detected computing unit according to the output result of the detecting unit and the output result of the detected computing unit to obtain a detection result.

And respectively inputting the thread data corresponding to the same instruction into the detected computing unit and the detecting unit for operation, and detecting the detected computing unit according to the output results of the detecting unit and the detected computing unit to obtain a detection result. Thus, whether the detected computing unit is abnormal or not can be judged according to the detection result, for example, when the output results of the detection result representation and the output results of the detected computing unit are inconsistent, the detected computing unit is determined to be abnormal by default, and the thread operation after the current thread moment is suspended.

Of course, in an embodiment, when the output result of the detection result characterization is inconsistent with that of the detection result characterization, the method further includes: input of a thread corresponding to an instruction subsequent to a time at which the instruction is located is suspended, and it is further determined whether a detected computing unit or the detecting unit is abnormal. In this embodiment, when the output results of the two detection result representations are inconsistent, it is no longer the default detected computing unit that is abnormal, but rather it is necessary to further determine whether the detected computing unit or the detection unit is abnormal.

When determining whether the detected computing unit or the detecting unit is abnormal, in one embodiment, the corresponding thread data when the detection results are inconsistent is respectively input into each computing unit and each detecting unit in the computing unit group for operation; whether the calculation unit is detected or the detection unit is abnormal is determined according to the output result of each calculation unit and the output result of the detection unit. That is, in this embodiment, the thread data corresponding to the detected computing unit (MAC5) is input into each computing unit (MAC0 ~ 15) and detecting unit (MAC16) in the computing unit group for operation, and whether the MAC5 unit or the MAC16 unit is damaged is determined according to the output result of the 17 MAC units. For example, when more than half of the output results of the calculation units coincide with the output result of the check unit (MAC16), the check unit (MAC16) is considered to be abnormal, and when more than half of the output results of the calculation units do not coincide with the output result of the check unit (MAC16), the check unit (MAC5) is considered to be abnormal.

In another embodiment, when determining whether the detected computing unit or the detecting unit is abnormal, the corresponding thread data is input to each of the computing units and the detecting unit in the computing unit group for operation when the detection results are inconsistent, the thread data is input to each of the computing units and the detecting unit in the computing unit group for operation again when more than half of the output results of the computing units are consistent with the output results of the detecting unit, and whether the detected computing unit or the detecting unit is abnormal is determined according to the output results of each computing unit for operation again and the output results of the detecting unit for operation again, for example, if the output results of more than half of the computing units for operation again are consistent with the output results of the detecting unit, the detected computing unit is determined to be abnormal. In this embodiment, when the output result of the calculation unit appearing twice more than half in succession matches the output result of the detection unit, it is determined that the detected calculation unit is abnormal.

In one embodiment, when it is determined whether the computing unit or the detecting unit is detected to be abnormal, when more than half of the output results of the computing unit and the output results of the detecting unit are inconsistent, it is determined that the detecting unit is abnormal. In another embodiment, when more than half of the output results of the calculation units are inconsistent with the output results of the detection units, the number of times of inconsistency of the output results is updated, and the number of times of inconsistency is cleared as long as more than half of the output results of the calculation units are consistent with the output results of the detection units; and when the updated inconsistency times are equal to a preset threshold value, determining that the detection unit is abnormal. In this embodiment, when the output result of more than half of the calculation units is inconsistent with the output result of the detection unit, the detection unit is not directly determined to be abnormal, but the number of times of occurrence of inconsistency is counted, the detection unit is determined to be abnormal only when the accumulated number of times of inconsistency reaches a preset threshold (for example, 5 times), and the number of times of inconsistency is cleared as long as the output result of more than half of the calculation units is consistent with the output result of the detection unit during the period.

Optionally, after determining that the detected computing unit is abnormal, the method further comprises: setting the detected computing unit to be in an unavailable state, and backtracking to a time node of a thread needing to be recalculated; and redistributing the threads corresponding to all the instructions needing to be recalculated after the time node to other calculation units except the detected calculation unit in the calculation unit group for operation. That is, the MAC5 is set to be in an unavailable state, the time node for checking the MAC5 last time is traced back, and the threads corresponding to all the instructions needing to be recalculated after the time node are reallocated to the MACs 0-4 and the MACs 6-15 except the MAC5 for operation. In order to not change the original operation logic, after the detected computing unit is determined to be abnormal, in an implementation mode, the detection unit is used as the computing unit by modifying the configuration of a relevant register so as to realize the replacement of the damaged MAC unit, in the implementation mode, the detected computing unit is set to be in an unavailable state, the detection unit is set as the computing unit so as to replace the detected computing unit, and the time node of the thread needing to be recalculated is traced back; and redistributing the threads corresponding to all the instructions needing to be recalculated after the time node to other calculating units except the detected calculating unit in the calculating unit group and the detecting unit for operation.

The method for detecting a computing unit in the embodiment of the present application may be performed in an idle stage of power-on of a chip or program operation, or may be performed in a normal operation process of software, for example, when 16 threads in one Wave information are respectively input to 16 computing units of MACs 0-15 for operation, assuming that detection is performed on MAC0, input data corresponding to MAC0 needs to be input to a detecting unit (MAC 16). If the output result of the MAC0 does not match the output result of the MAC16, the input data corresponding to the MAC0 needs to be input to the MACs 0 to 16, respectively, for calculation.

The method provided by the embodiment of the present application, which has the same implementation principle and the same technical effect as the foregoing device embodiment, for the sake of brief description, and where no part of the method embodiment is mentioned, reference may be made to the corresponding content in the foregoing device embodiment.

As shown in fig. 5, fig. 5 is a block diagram illustrating a structure of an electronic device 100 according to an embodiment of the present disclosure. The electronic device 100 includes: a transceiver 110, a memory 120, a communication bus 130, and a parallel processor 140.

The elements of the transceiver 110, the memory 120, and the parallel processor 140 are electrically connected to each other directly or indirectly to realize data transmission or interaction. For example, the components may be electrically coupled to each other via one or more communication buses 130 or signal lines. The transceiver 110 is used for transceiving data. The memory 120 is used for storing a computer program including at least one software functional module which can be stored in the memory 120 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the electronic device 100. The parallel processor 140 is used to execute executable software functional modules or computer programs stored in the memory 120. For example, the parallel processor 140 is configured to, when detecting a computing unit in the computing unit group, input thread data corresponding to the same instruction into a detected computing unit and the detected computing unit respectively for operation; and detecting the detected computing unit according to the output result of the detecting unit and the output result of the detected computing unit to obtain a detection result.

The Memory 120 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

The parallel processor 140 may be an integrated circuit chip having signal processing capabilities. The parallel processor described above may be a graphics processor GPU, or other AI parallel processor. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed.

The electronic device 100 includes, but is not limited to, a smart phone, a tablet, a computer, a server, and the like.

The present embodiment also provides a non-volatile computer-readable storage medium (hereinafter, referred to as a storage medium), where the storage medium stores a computer program, and the computer program is executed by a computer such as the electronic device 100 described above to perform the above-described computing unit detection method.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing an electronic device (which may be a personal computer, a notebook computer, a server, or an electronic device) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of detecting a computational element, applied to a parallel processor comprising a set of computational elements and a detection unit, the method comprising:

respectively inputting thread data corresponding to the same instruction into a detected computing unit and a detection unit for operation; wherein the detected computing unit is a computing unit in the computing unit group;

and detecting the detected computing unit according to the output result of the detecting unit and the output result of the detected computing unit to obtain a detection result.

2. The method of claim 1, wherein when the detection result indicates that the output result of the detected computing unit is inconsistent with the output result of the detection unit, the method further comprises:

and pausing input of a thread corresponding to the instruction after the moment of the instruction, and determining whether the detected computing unit or the detecting unit is abnormal.

3. The method of claim 2, wherein determining whether the detected computing unit or the detection unit is abnormal comprises:

respectively inputting the corresponding thread data into each computing unit in the computing unit group and the detection unit to carry out operation when the detection results are inconsistent;

And determining whether the detected computing unit is abnormal or the detecting unit is abnormal according to the output result of each computing unit and the output result of the detecting unit, wherein when more than half of the computing units are consistent with the output result of the detecting unit, the detected computing unit is determined to be abnormal.

4. The method of claim 2, wherein determining whether the detected computing unit or the detection unit is abnormal comprises:

when the output results of more than half of the computing units are consistent with the output result of the detection unit, respectively inputting the thread data into each computing unit in the computing unit group and the detection unit again for operation;

and determining whether the detected computing unit or the detecting unit is abnormal according to the output result of each computing unit which is operated again and the output result of the detecting unit which is operated again, wherein if the output result of more than half of the computing units which are operated again is consistent with the output result of the detecting unit, the detected computing unit is determined to be abnormal.

5. The method of claim 3 or 4, wherein determining whether the detected computing unit or the detection unit is abnormal further comprises:

when more than half of the output results of the computing units are inconsistent with the output results of the detection units, updating the times of inconsistent output results, wherein if the more than half of the output results of the computing units are consistent with the output results of the detection units, the times are cleared;

and when the updated times are equal to a preset threshold value, determining that the detection unit is abnormal.

6. The method of claim 2, wherein after determining that the detected computing unit is abnormal, the method further comprises:

setting the detected computing unit to be in an unavailable state, and backtracking to a time node of a thread needing to be recalculated;

and redistributing the threads corresponding to all the instructions which need to be recalculated after the time node to other calculating units in the calculating unit group except the detected calculating unit for operation.

7. The method of claim 2, wherein after determining that the detected computing unit is abnormal, the method further comprises:

Setting the detected computing unit to be in an unavailable state, setting the detecting unit to be a computing unit to replace the detected computing unit, and tracing back to a time node of a thread to be recalculated;

and redistributing the threads corresponding to all the instructions which need to be recalculated after the time node to other calculating units except the detected calculating unit in the calculating unit group and the detecting unit for operation.

8. The method according to claim 1, wherein, when detecting the computing units in the computing unit group, polling detection is performed on each computing unit in the computing unit group according to a preset order.

9. A parallel processor comprising a SIMD architecture, the SIMD architecture comprising:

calculating a unit group; a detection unit; a control unit;

the scheduling unit is used for respectively inputting the thread data corresponding to the same instruction into a detected computing unit and the detection unit to carry out operation under the control of the control unit, wherein the detected computing unit is a computing unit in the computing unit group;

and the arbitration unit is used for detecting the detected computing unit according to the output result of the detection unit and the output result of the detected computing unit under the control of the control unit to obtain a detection result.

10. The parallel processor according to claim 9, wherein the control unit is further configured to suspend input of a thread corresponding to an instruction after the time when the instruction is located, and determine whether the detected computing unit or the detecting unit is abnormal, when the detection result indicates that the output result of the detected computing unit is inconsistent with the output result of the detecting unit.

11. The parallel processor according to claim 10, wherein the control unit is further configured to control the scheduling unit to input corresponding thread data to each of the computing units in the computing unit group and the detecting unit for operation when the detection results are inconsistent; and further for determining whether the detected computing unit or the detecting unit is abnormal, based on an output result of each computing unit and an output result of the detecting unit, wherein the detected computing unit is determined to be abnormal when more than half of the output results of the computing units are identical to the output result of the detecting unit.

12. The parallel processor according to claim 10, wherein the control unit is further configured to control the scheduling unit to input corresponding thread data to each of the computing units in the computing unit group and the detecting unit for operation when the detection results are inconsistent; when the output results of more than half of the computing units are consistent with the output results of the detection units, controlling the scheduling unit to input the thread data into each computing unit in the computing unit group and the detection units respectively for operation; and determining whether the detected computing unit or the detecting unit is abnormal according to the output result of each computing unit which is operated again and the output result of the detecting unit which is operated again, wherein if the output result of more than half of the computing units which are operated again is consistent with the output result of the detecting unit, the detected computing unit is determined to be abnormal.

13. The parallel processor according to claim 11 or 12, wherein the control unit is further configured to update the number of times of occurrence of inconsistency of the output results when more than half of the output results of the calculation units are consistent with the output results of the detection units, wherein the number of times is cleared if more than half of the output results of the calculation units are consistent with the output results of the detection units; and further configured to determine that the detection unit is abnormal when the updated number of times is equal to a preset threshold.

14. The parallel processor of claim 10, wherein the SIMD architecture further comprises: a storage unit for storing an input instruction block;

the control unit is also used for setting the detected computing unit to be in an unavailable state after determining that the detected computing unit is abnormal, and backtracking to a time node needing to recalculate the thread; and reallocating threads corresponding to all the instructions which need to be recalculated after the time node to other computing units in the computing unit group except the detected computing unit for operation based on the instruction block stored in the storage unit.

15. The parallel processor of claim 10, wherein the SIMD architecture further comprises: a storage unit for storing an input instruction block;

the control unit is also used for setting the detected computing unit to be in an unavailable state after determining that the detected computing unit is abnormal, setting the detecting unit to be the computing unit to replace the detected computing unit, and tracing back to a time node needing to recalculate the thread; and reallocating threads corresponding to all the instructions which need to be recalculated after the time node to other computing units in the computing unit group except the detected computing unit for operation based on the instruction block stored in the storage unit.

16. An electronic device, comprising: a parallel processor as claimed in any one of claims 9 to 15.