CN112084071A

CN112084071A - Calculation unit operation reinforcement method, parallel processor and electronic equipment

Info

Publication number: CN112084071A
Application number: CN202010963761.2A
Authority: CN
Inventors: 袁庆; 陈庆
Original assignee: Haiguang Information Technology Co Ltd
Current assignee: Haiguang Information Technology Co Ltd
Priority date: 2020-09-14
Filing date: 2020-09-14
Publication date: 2020-12-15
Anticipated expiration: 2040-09-14
Also published as: CN112084071B

Abstract

The application relates to a computing unit operation reinforcement method, a parallel processor and electronic equipment, and belongs to the technical field of computers. The method is applied to a parallel processor, the parallel processor comprises a computing unit group and a redundant unit group, the redundant unit group comprises a first redundant unit and a second redundant unit, the method comprises the steps of respectively inputting data corresponding to the same thread into a computing unit to be reinforced and the first redundant unit and the second redundant unit in the redundant unit group corresponding to the computing unit to be reinforced for operation; wherein the computing unit to be reinforced is a computing unit in the computing unit group; judging whether at least two identical output results exist in the output result of the first redundancy unit, the output result of the second redundancy unit and the output result of the calculation unit to be reinforced; when there are at least two identical output results, the same output result is output. The method realizes reinforcement of the calculation unit to be reinforced through redundant calculation, and solves the problem of poor reinforcement effect in the existing method.

Description

Calculation unit operation reinforcement method, parallel processor and electronic equipment

Technical Field

The application belongs to the technical field of computers, and particularly relates to a computing unit operation reinforcement method, a parallel processor and electronic equipment.

Background

At present, due to the requirement of large-scale parallel computing and the development of Artificial Intelligence (AI), parallel processors (such as Graphics Processing Unit (GPU) and the like) are widely applied in various fields of Artificial Intelligence learning training, large-scale scientific computing, aerospace, automatic driving and the like.

For general use environments, special reinforcement design for the computing units in the parallel processor is not required, but for some special use environments, special reinforcement design for the computing units in the line processor is required and required to reduce loss caused by computing errors. For automatic driving, a certain reinforcement design needs to be carried out on a calculation sheet in a parallel processor so as to prevent traffic accidents caused by calculation errors; for another example, for aerospace, a situation that a register is bombarded by high-energy particles is likely to occur, so that a value generator in the register jumps (changes) suddenly, and therefore a certain reinforcement design must be performed on a computation sheet in a parallel processor. In addition, compared to a conventional Central Processing Unit (CPU), a parallel processor (e.g., GPU) has many transistors and the arithmetic units are parallel to each other, so that errors due to register jump burst are more likely to occur.

Currently, Error correction and consolidation methods are commonly used, such as Error Correction Code (ECC) and checkpoint techniques. In which the checkpoint technique needs to perform recalculation after detecting an error, which will result in time loss, and whether the error can be detected is also a probabilistic problem. The ECC check is implemented by adding extra retention to the original data bits, and is mainly used for memory circuits, and for arithmetic circuits, because of the large number of parallel computing units, more logic needs to be added, which will result in extra time consumption.

Disclosure of Invention

In view of the above, an object of the present application is to provide a computing unit operation reinforcing method, a parallel processor, and an electronic device, so as to solve the problems of long detection time, complex processing logic, and low accuracy of the conventional error correction and reinforcing method.

The embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a computing unit operation reinforcing method, which is applied to a parallel processor, where the parallel processor includes a computing unit group and a redundant unit group, and the redundant unit group includes: a first redundancy unit and a second redundancy unit, the method comprising: respectively inputting data corresponding to the same thread into a computing unit to be reinforced and a first redundant unit and a second redundant unit in a redundant unit group corresponding to the computing unit to be reinforced for operation; wherein the computing unit to be reinforced is a computing unit in the computing unit group; judging whether at least two identical output results exist in the output result of the first redundancy unit, the output result of the second redundancy unit and the output result of the to-be-reinforced calculation unit; outputting the same output result when there are at least two same output results. In the embodiment of the application, the redundant unit group is used for performing redundant calculation on the thread data input into the calculation unit to be reinforced, then the calculation unit to be reinforced is reinforced through the redundant calculation, and the same result is output, so that the problems of long detection time, complex processing logic and poor reinforcing effect existing in the conventional error correction and reinforcement method are solved.

With reference to a possible implementation manner of the embodiment of the first aspect, before the data corresponding to the same thread is input into the to-be-hardened computation unit and the first redundant unit and the second redundant unit in the redundant unit group corresponding to the to-be-hardened computation unit respectively for operation, the method further includes: according to the operation reinforcement level obtained from the register, determining the number of computing units corresponding to the operation reinforcement level from all computing units of the parallel processor to form the computing unit group; correspondingly, inputting data corresponding to the same thread into a computing unit to be consolidated and a first redundant unit and a second redundant unit in a redundant unit group corresponding to the computing unit to be consolidated respectively for operation, comprising: selecting threads with the same number as the number of the computing units in the computing unit group, inputting data corresponding to the selected threads into the computing units in the computing unit group for operation, and respectively inputting data corresponding to the threads input into the computing units to be reinforced into a first redundant unit and a second redundant unit in a redundant unit group corresponding to the computing units to be reinforced for operation, wherein one computing unit corresponds to one thread. In the embodiment of the application, the number of the computing units in the computing unit group is determined through the operation reinforcement level obtained from the register, the thread size is adjusted in a self-adaptive mode, and the threads with the corresponding number are selected to be distributed, so that the method can be suitable for different reinforcement requirements.

With reference to one possible implementation manner of the embodiment of the first aspect, the method further includes: and caching the output results of the other computing units except the computing unit to be reinforced in the computing unit group and the same output result, and outputting the output results of the preset number according to a first-in first-out sequence when the cached output results reach the preset number. In the embodiment of the application, the results output by the computing unit are cached, when the cached output results reach the preset number, the output results of the preset number are output according to the first-in first-out sequence, so that under the condition that the computing unit is not additionally added, the redundancy design can be realized by utilizing the original computing resources, and then the redundancy design is utilized for reinforcement.

With reference to a possible implementation manner of the embodiment of the first aspect, after determining whether at least two identical output results exist in the output result of the first redundant unit, the output result of the second redundant unit, and the output result of the to-be-consolidated calculation unit, the method further includes: when at least two same output results do not exist, stopping the input of the thread after the moment of the same thread, and inputting the data corresponding to the same thread into the computing unit to be reinforced and the first redundant unit and the second redundant unit in the redundant unit group corresponding to the computing unit to be reinforced respectively again for operation; judging whether at least two same latest output results exist in the latest output result of the first redundancy unit, the latest output result of the second redundancy unit and the latest output result of the to-be-reinforced calculation unit; outputting the same latest output result when there are at least two same latest output results. In the embodiment of the application, when at least two same output results do not exist, the input of the suspended thread is utilized again to calculate, and the output results of the to-be-reinforced calculating unit and the corresponding first redundant unit and second redundant unit are judged again to eliminate errors caused by program abnormity.

With reference to a possible implementation manner of the embodiment of the first aspect, after determining whether there are at least two identical latest output results in the latest output result of the first redundant unit, the latest output result of the second redundant unit, and the latest output result of the to-be-consolidated calculation unit, the method further includes: and if the latest output result of the first redundancy unit, the latest output result of the second redundancy unit and the latest output result of the to-be-reinforced computing unit are different, an error is reported. In the embodiment of the application, if the 3 output results of the two calculations are still different, an error is reported, so that related personnel can timely know the situation.

With reference to a possible implementation manner of the embodiment of the first aspect, the number of the redundant unit groups is smaller than the number of the computing units in the computing unit group, and the computing units in the computing unit group are polled and consolidated according to a preset order by using the redundant unit groups. In the embodiment of the application, when the number of the redundant unit groups is smaller than the number of the computing units in the computing unit groups, the computing units in the computing unit groups are subjected to polling reinforcement by using the redundant unit groups according to the preset sequence, so that each computing unit in the computing unit groups can be reinforced within a certain time, and the reinforcing effect can be further improved.

With reference to one possible implementation manner of the embodiment of the first aspect, the number of the redundancy unit groups is equal to the number of the computing units in the computing unit group, and the parallel processor further includes: polling redundant units, the method further comprising: inputting the data corresponding to the same thread into the polling redundancy unit for operation; correspondingly, judging whether at least two identical output results exist in the output result of the first redundancy unit, the output result of the second redundancy unit and the output result of the to-be-consolidated calculation unit comprises the following steps: and judging whether at least three same output results exist in the output result of the first redundancy unit, the output result of the second redundancy unit, the output result of the polling redundancy unit and the output result of the to-be-consolidated calculation unit or not, or two same output results exist and the other two output results are different, wherein when at least three same output results exist or two same output results exist and the other two output results are different, the same output result is output. In the embodiment of the application, when the parallel processor further includes a polling redundancy unit, and when the computing unit to be reinforced is reinforced, input data input into the computing unit to be reinforced is input into the polling redundancy unit for computing, and 4 output results including the polling redundancy unit are synthesized to determine a final output result, so as to enhance the reinforcement effect and realize the full utilization of computing resources (because when the number of the redundancy unit groups is the same as the number of the computing units in the computing unit groups, there will usually be redundant computing units, and by taking the extra computing units as the polling redundancy units, the computing units to be reinforced are reinforced together with the redundancy unit groups, so as to realize the full utilization of computing resources).

With reference to a possible implementation manner of the embodiment of the first aspect, the polling redundancy unit is used to perform polling hardening on the computing units in the computing unit group according to a preset order. In the embodiment of the application, when the number of the redundant unit groups is equal to the number of the computing units in the computing unit groups, the polling redundant units are used for polling and reinforcing the computing units in the computing unit groups according to the preset sequence, so that each computing unit in the computing unit groups can be reinforced by the polling redundant unit within a certain time, and the reinforcing effect can be further improved.

In a second aspect, embodiments of the present application further provide a parallel processor, including a Single Instruction Multiple Data (SIMD) architecture, including: the device comprises a plurality of computing units, a control unit and an arbitration unit; a part of the plurality of computing units are used as redundant units to form at least one redundant unit group, and the rest of the computing units are used as computing unit groups, wherein each redundant unit group comprises a first redundant unit and a second redundant unit; the scheduling unit is used for respectively inputting data corresponding to the same thread into a computing unit to be reinforced and a first redundant unit and a second redundant unit in a redundant unit group corresponding to the computing unit to be reinforced under the control of the control unit to carry out operation; wherein the computing unit to be reinforced is a computing unit in the computing unit group; and the arbitration unit is used for judging whether at least two identical output results exist in the output result of the first redundancy unit, the output result of the second redundancy unit and the output result of the calculation unit to be consolidated under the control of the control unit, and outputting the identical output results when at least two identical output results exist.

With reference to a possible implementation manner of the embodiment of the second aspect, the parallel processor is connected to the register, and the control unit is further configured to, before controlling the scheduling unit to input the data corresponding to the same thread into the to-be-consolidated calculation unit and the first redundancy unit and the second redundancy unit in the redundancy unit group corresponding to the to-be-consolidated calculation unit, respectively, to perform an operation, determine, from the plurality of calculation units, a number of calculation units corresponding to the operation consolidation level according to an operation consolidation level obtained from the register, so as to form the calculation unit group, where the number of calculation units corresponding to different operation consolidation levels is different; correspondingly, the scheduling unit is configured to select, under the control of the control unit, threads of which the number is the same as that of the computing units in the computing unit group, input data corresponding to the selected threads to the computing units in the computing unit group for operation, and input data corresponding to the threads input to the computing unit to be consolidated to the first redundant unit and the second redundant unit in the redundant unit group corresponding to the computing unit to be consolidated respectively for operation, where one computing unit corresponds to one thread.

In combination with one possible implementation manner of the embodiment of the second aspect, the SIMD architecture further includes: and the data caching unit is used for caching the output results of the other computing units except the computing unit to be reinforced in the computing unit group and the same output result under the control of the control unit, and outputting the output results of the preset number according to a first-in first-out sequence when the output results to be cached reach the preset number.

With reference to a possible implementation manner of the second aspect, the arbitration unit is further configured to feed back the determination result to the control unit, and the control unit is further configured to suspend input of a thread after a moment when the same thread is located and control the scheduling unit when at least two same output results do not exist, and input data corresponding to the same thread to the to-be-consolidated calculation unit and the first redundancy unit and the second redundancy unit in the redundancy unit group corresponding to the to-be-consolidated calculation unit for operation again; the arbitration unit is further configured to determine whether at least two identical latest output results exist in the latest output result of the first redundancy unit, the latest output result of the second redundancy unit, and the latest output result of the to-be-consolidated calculation unit, and output the identical latest output result when at least two identical latest output results exist.

With reference to a possible implementation manner of the embodiment of the second aspect, the control unit is further configured to report an error if the latest output result of the first redundancy unit, the latest output result of the second redundancy unit, and the latest output result of the to-be-consolidated calculation unit are all different.

With reference to a possible implementation manner of the embodiment of the second aspect, the number of the redundant unit groups is smaller than the number of the computing units in the computing unit groups, and the control unit is further configured to perform polling hardening on the computing units in the computing unit groups according to a preset order by using the redundant unit groups.

With reference to one possible implementation manner of the embodiment of the second aspect, when the number of the redundancy unit groups is equal to the number of the computing units in the computing unit group, a part of the computing units is also used as a polling redundancy unit; the scheduling unit is further configured to input data corresponding to the same thread into the polling redundancy unit for operation under the control of the control unit; the arbitration unit is further configured to determine whether at least three identical output results exist in the output result of the first redundancy unit, the output result of the second redundancy unit, the output result of the polling redundancy unit, and the output result of the to-be-consolidated calculation unit, or two identical output results exist and the other two output results are different, where the identical output results are output when at least three identical output results exist, or two identical output results exist and the other two output results are different.

With reference to a possible implementation manner of the embodiment of the second aspect, the control unit performs polling reinforcement on the computing units in the computing unit group according to a preset order by using the polling redundancy unit.

In a third aspect, an embodiment of the present application further provides an electronic device, including: an ontology and a parallel processor as provided for carrying out the embodiments of the second aspect described above and/or in connection with any of the possible implementations of the embodiments of the second aspect.

With reference to a possible implementation manner of the embodiment of the third aspect, the electronic device further includes a register, where the register is connected to the parallel processor and is configured to respond to an operation reinforcement level configuration operation input by a user to complete a corresponding configuration operation.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts. The foregoing and other objects, features and advantages of the application will be apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the drawings. The drawings are not intended to be to scale as practical, emphasis instead being placed upon illustrating the subject matter of the present application.

Fig. 1 shows a schematic diagram of a conventional parallel processor.

Fig. 2 is a schematic diagram illustrating an operation manner of a MAC in a typical SIMD architecture according to an embodiment of the present application.

FIG. 3 shows a schematic diagram of a SIMD architecture provided by an embodiment of the present application.

Fig. 4 shows a polling hardening schematic diagram when the operation hardening level provided by the embodiment of the present application is harden reg ═ 1.

Fig. 5 shows a polling hardening schematic diagram when the operation hardening level provided by the embodiment of the present application is harden reg ═ 2.

Fig. 6 shows a polling hardening schematic diagram when the operation hardening level provided by the embodiment of the present application is harden reg-5.

Fig. 7 shows a polling hardening schematic diagram when the operation hardening level provided by the embodiment of the present application is harden reg-7.

Fig. 8 shows a flowchart of a computing unit operation reinforcing method provided in an embodiment of the present application.

Fig. 9 is a schematic flowchart illustrating a further calculation unit operation strengthening method according to an embodiment of the present application.

Fig. 10 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, relational terms such as "first," "second," and the like may be used solely in the description herein to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Further, the term "and/or" in the present application is only one kind of association relationship describing the associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone.

In order to solve the problems of long detection time, complex processing logic and low accuracy of the currently commonly used error correction and reinforcement method, an embodiment of the present application provides a calculation unit calculation reinforcement method, which implements calculation reinforcement of a calculation unit through a redundancy design, in one implementation manner, a part of original Multiple calculation units is used as a redundancy unit to form at least one redundancy unit group (each redundancy unit group includes a first redundancy unit and a second redundancy unit), the remaining part is a calculation unit group, and by increasing certain hardware consumption, on the premise that an original basic circuit is not changed, the calculation reinforcement level is adjusted according to different application requirements, the size of a thread (thread) in an Instruction block (wave) and the allocation of calculation resources in a Single Instruction Multiple Data (SIMD) architecture are adaptively adjusted, therefore, calculation reinforcement of the calculation unit is realized through the calculation resources of the calculation unit. And the operation reinforcement level can be dynamically adjusted, and is suitable for different reinforcement requirements under different working environments. It should be noted that, in addition to using a part of the original plurality of computing units as a redundant unit, the original computing units may be additionally provided with computing units to be used as redundant units. In view of the principle of the two ways being the same, the present embodiment is only described in the embodiment of the present application in which a part of the original multiple computing units are used as redundant units, but the present application is not to be construed as being limited thereto.

To facilitate understanding of the computing unit operation reinforcement method provided in the embodiments of the present application, the architecture of the parallel processor shown in fig. 1 is taken as an example below. The parallel processor includes 4 mutually independent Shader Engines (SE), SE0, SE1, SE2, and SE3, respectively. Each SE includes 16 CUs (computing units), CU0, CU1, … …, CU14, and CU 15. Each CU contains 4 Single Instruction Multiple Data (SIMD) architectures, SIMD0, SIMD1, SIMD2, SIMD3, respectively. Each SIMD contains 16 MACs, each MAC performing a basic multiply-add operation, which is the most basic computational unit (MAC). Each SE corresponds to a Work Group (Work Group), each SIMD corresponds to an instruction block (Wave), one Wave corresponds to 64 threads (thread), the whole Wave operation can be completed by dividing into 4 stages (Phase), and each MAC corresponds to one thread.

Fig. 2 is a schematic diagram of the operation of the MAC in a typical SIMD architecture, which simplifies functional discussion such as data reading. Wave information is input into a front-end scheduling unit (thread _ dispatch) for allocation, 16 thread data in the Wave information are input into 16 MACs in parallel for operation at one time during allocation, the whole Wave operation can be completed through 4 stages (phases), and 16 threads (threads) are input into each phase. The operation result reaches Wave _ OUT (WO) for first-in first-OUT operation and final output, wherein Wave _ OUT is used for completing selection and output of MAC calculation data. In the embodiment of the present application, since it is necessary to use its own computing resource (MAC) to implement operation reinforcement, and it is necessary to add an additional part of hardware modules, in the present application, a control unit (control) and a related arbitration unit (arbitration), a thread cache unit (thread _ fifo), a thread reservation unit (thread _ skip), and a data cache unit (date _ fifo) are added in each SIMD architecture to implement the function of operation reinforcement logic, and a schematic diagram of the modified SIMD architecture is shown in fig. 3. It should be noted that, if the original computing unit is used as a redundant unit by additionally adding a computing unit, in this embodiment, the thread cache unit (thread _ fifo) and the data cache unit (date _ fifo) in fig. 3 may be eliminated. In order to adapt to different reinforcement requirements, dynamic adjustment of an operation reinforcement level can be realized by adopting a multi-bit register (harden reg) (certainly, in some fixed scenes, if flexible adjustment of the reinforcement level is not needed, a register does not need to be configured), the register is used for representing the level of redundancy calculation, the level of redundancy calculation is represented by the value of harden reg which is 0, and all calculation resources are used for calculation; harden reg is 1 to represent lowest level reinforcement (first level reinforcement), at this time, 2 calculation units in the original 16 calculation units are used as redundant units to form a redundant unit group, the remaining 14 calculation units form a calculation unit group (including 14 calculation units), and 14 calculation units are reinforced by utilizing redundant unit group polling; reg ═ 2 denotes the higher-level consolidation (second-level consolidation), corresponds to 12 calculation units (in this case, the calculation unit group includes 12 calculation units) and 4 redundancy units (i.e., 2 redundancy unit groups), and so on, and is fixed as the MAC of the calculation unit and the redundancy unit, as shown in table 1 below.

TABLE 1

In this embodiment, according to the level of the operation consolidation level, a part of the 16 calculation units may be used as redundancy units to form at least one redundancy unit group, and the remaining part is a calculation unit group, each redundancy unit group includes a first redundancy unit and a second redundancy unit, and redundant redundancy units (for example, in the case of the highest consolidation level, one calculation unit corresponds to one redundancy unit group, at this time, except the calculation units included in the calculation unit group and the redundancy unit group, the remaining calculation units are redundant redundancy units, and if there are 16 MACs, one MAC remains) are used as polling redundancy units to consolidate each calculation unit by polling. For the reinforcement of the highest level, corresponding to 5 computing units (in this case, the computing unit group includes 5 computing units) and 5 redundancy unit groups, at this time, the remaining one redundancy unit (polling redundancy unit) may be reinforced for each computing unit in a polling manner, and it should be noted that, here, the number of default initial computing units (MAC) is 16, and the number of polling redundancy units is not limited to 16. It should be noted that the above 16 MACs are equivalent, that is, the processing functions of the threads are the same, and which of the 16 MAC units are configured as redundant units through registers, so the example shown in table 1 cannot be understood as a limitation of the present application.

Since the number of computing units changes at different operation reinforcement levels, the size of the input thread needs to be adjusted adaptively, so that the size of the input thread needs to be adjusted adaptively by increasing a control unit (control), and a thread cache unit (thread _ fifo) is added at the input end of a scheduling unit (thread _ dispatch) for caching the threads in the input wave information so as to redistribute the threads to the corresponding MAC; arbitrating the output result of the calculation unit to be consolidated and the output result of the corresponding redundancy unit group by adding an arbitration unit (arbitration), and judging the correct calculation result; and adding a data buffer unit (date _ fifo) for buffering the calculation results and performing recombination output.

WI is the input receiving end of input wave, and is used for caching the input threads and completing redistribution of the input threads according to the control of the control unit. Since each wave corresponds to fixed 64 threads (part of threads may be marked as disabled, so the number of true threads may be less than 64), one wave needs 4 clk (clocks) to complete the transmission of the threads, and each clk transmits 16 threads. According to the reinforcement levels of different levels, the calculation units capable of supporting calculation at each time are changed, and the mismatch of input rate and output rate exists, so that a thread cache unit (thread _ fifo) is added in the WI for matching the rate of the thread cache unit, and the thread in the thread _ fifo is read out for calculation according to different reinforcement levels. Taking harden reg ═ 1 as an example, the number of calculation units is 14, the number of redundancy units is 2, and the calculation units are fixed to MAC1 and MAC 2. Shader Sequencers (SQ) in a CU input threads into SIMD, WI receives SQ input thread number and directly stores the thread number into a thread cache unit, a scheduling unit at the rear end reads the first 14 threads based on the situation that harden reg is 1 under the control of a control unit and distributes the threads to MACs 0-15, wherein MAC1 and MAC2 are used as redundancy units, the scheduling unit distributes thread data corresponding to a calculation unit to be consolidated to MAC1 and MAC2 for consolidating different MAC units in different time periods in a polling mode along with the progress of a calculation flow, and the rest MACs receive the threads in sequence, and the specific corresponding relation is shown in Table 2.

TABLE 2

After clk5, the new wave corresponding threads are entered and the polling allocation continues. When arbitration fails in arbs 0-15, recalculation of incoming thread is required. Therefore, the thread allocated by the dispatch unit (thread _ dispatch) needs to be stored, as an implementation manner, a thread _ skip unit may be added to the WI, and the unit module receives and stores the output of the dispatch unit (thread _ dispatch) when the unit module normally works; when arbitration fails and stall occurs, the thread _ skip unit provides a data source for the scheduling unit (thread _ dispatch) and suspends the storage of the thread output by the scheduling unit, so that 16 sets of MAC at the back end can obtain the same input as the last operation. Wherein, can use arb _ ctrl _ stall signal (1 bit) to represent whether need to recalculate the input nearest neighbor thread, when working normally, the signal maintains 0 state; when there is only one arbitration failure in arbs 0-15, the signal goes high, and the dispatch unit can obtain the thread from the thread _ skip unit for recalculation. The signal is simultaneously output to the outside for stopping the wave input to the SIMD. Of course, in an embodiment, the thread _ skip unit may not be added, and the function is implemented by the thread cache unit, that is, after the scheduling unit obtains the number of threads corresponding to the number of the computing units from the thread cache unit, the thread cache unit continues to store the obtained threads.

For WI, the control unit completes control of arb _ ctrl _ WI _ check _ thread, arb _ ctrl _ WI _ check _ thread and arb _ ctrl _ stall through signals arb _ ctrl _ WI _ computer, and arb _ ctr _ WI _ computer is used for calibrating the number of threads read out from the thread cache unit at a time. The arb _ ctrl _ wi _ check _ thread is a 16-bit signal line and is used for calibrating the serial number of the thread currently reinforced, the number of the computing units under configuration is determined and is used as the position determination of the redundant units, the thread is distributed to the computing units according to the sequence, and the distribution of the final data can be determined only by calibrating the position of the reinforced thread. Taking harden reg 2 as an example, the MAC for calculation and redundancy can be found from table 1, and it is obvious that arb _ ctr _ wi _ compute is 12, so in this configuration the scheduling unit reads 12 threads from the thread cache unit (thread _ fifo) at a time. clk1, arb _ ctrl _ wi _ check _ thread ═ 16' h0003, which indicates that thread0 and thread1 need to be reinforced; clk2, arb _ ctrl _ wi _ check _ thread ═ 16 'h 0006, which illustrates that thread1 and thread2 need reinforcement, clk3, arb _ ctrl _ wi _ check _ thread ═ 16' h000C, which illustrates that thread2 and thread3 need reinforcement, and so on.

The control unit (control) is used as a core module with a newly added function, and is configured to determine, from a plurality of (e.g., 16) computing units, a number of computing units corresponding to an operation reinforcement level according to an operation reinforcement level configured by a static register (harden reg), to form a computing unit group (where, since a total computing unit is fixed, after the number of computing units corresponding to the operation reinforcement level is determined, the remaining computing units are redundant units), so as to dynamically adjust the number of threads input, taking harden reg ═ 2 as an example, it can be known through table lookup 1 that at this time, 12 computing units and 4 redundant units correspond to 16 MACs, and then the control scheduling unit selects 12 threads from the thread cache unit to allocate, where one computing unit corresponds to one thread, and the control scheduling unit inputs data corresponding to the same thread into the computing unit to be reinforced and a first redundant unit and a second redundant unit in the redundant unit group corresponding to the computing unit to be reinforced, respectively And performing operation in the rest units, wherein the calculation unit to be reinforced is a calculation unit in the calculation unit group. And controlling which arbitration units of the 16 arbitration units need to arbitrate, which arbitration units are invalid and which are direct output units, and outputting a stall signal to suspend the input of subsequent threads when arbitration results fed back by the arbitration units fail to arbitrate, and tracing back to a time point at which the threads need to be recalculated, and using threads cached in the thread _ skip unit to re-reinforce the calculation unit to be reinforced; and reporting an error when the arbitration failure occurs again, for example, if the latest output result of the first redundancy unit, the latest output result of the second redundancy unit, and the latest output result of the to-be-consolidated calculation unit are all different. And controlling the data buffering unit, for example, controlling the data buffering unit to buffer the output results of the other computing units in the computing unit group except the computing unit to be consolidated and the same output result, and outputting the output results of the preset number in a first-in first-out order when the output results to be buffered reach the preset number (the value of which is the same as the total MAC number, such as 16). In addition, in the reinforcing process, the control unit can also record the polling reinforcing condition in real time so as to know which computing unit is reinforced at the current moment.

The added arbitration units (arb 0-16) are used for completing arbitration, data direct communication or data invalid calibration under the control of the control unit (control), each arbitration unit (arb) is connected with the control unit and 16 MACs, and each arbitration unit can simultaneously receive the output of the 16 MACs. When arbitration (namely data direct connection) is not needed, the output result of the computing unit corresponding to the computing unit is directly selected from the 16 output results to be output; when arbitration is needed, selecting self corresponding output results from the 16 output results to carry out arbitration (for example, completing arbitration of the output results of the computing unit to be consolidated with the output results of the corresponding first redundancy unit and the output results of the second redundancy unit, or completing arbitration of the output results of the computing unit to be consolidated with the output results of the corresponding first redundancy unit, the output results of the second redundancy unit and the output results of the polling redundancy units), outputting correct results, and feeding back the arbitration results to a control unit (control) for further processing; when invalid, no data is output. If only one redundancy unit group (MAC1, MAC2) exists, clk1, and the to-be-consolidated calculation unit is MAC0, the corresponding arbitration unit needing arbitration is arb0, arb0 selects the output results of MAC0, MAC1, and MAC2 from 16 MACs to perform judgment, that is, judges whether there are at least 2 identical output results in the 3 output results, and if yes, outputs the identical results, and the invalid arbitration units are arb1 and arb2 and do not output data; arb3-arb15 and MAC0-MAC15 are in one-to-one correspondence, output results corresponding to arb3 and MAC3 are selected from 16 output results and output, arb4 selects MAC4 results from 16 output results and outputs, and … … arb15 selects MAC15 results from 16 output results and outputs. clk2, if the unit to be reinforced is MAC3, the corresponding arbitration unit needing arbitration is arb3, arb3 selects the output results of MAC3, MAC1 and MAC2 from 16 MACs for judgment, and the invalid arbitration units are arb1 and arb2 and do not output data; arb0, arb4-arb15 correspond to MAC0, MAC4-MAC15 one-to-one, and so on.

The control unit controls each arbitration unit by sending a control signal (arb _ ctrl _ mask, which has N bits, where N is the sum of the number of computing units and redundant units in the computing unit set, and is 16 if no computing unit is additionally added), and when there are three bits or 4 bits in arb _ ctrl _ mask, it indicates that arbitration is needed; if only 1 bti bits in the arb _ ctrl _ mask are high, it indicates that arbitration is not required and the arbitrage is a through output; if all of arb _ ctrl _ masks are 0, it indicates invalid. Meanwhile, an arbitration result is fed back to the control unit and the WO through a signal line arb _ stall [15:0], each bit in the arb _ stall corresponds to one arb, arbitration failure occurs only in the arb unit working in an arbitration mode, the bit is set to be 1, and otherwise, the state of 0 is maintained. For example, when the corresponding arb _ ctrl _ mask of arb0 is 16' h0007, it indicates that the result needs to be arbitrated, the data sources are MAC0, MAC1, and MAC2, and the correct result is output; when arb _ ctrl _ mask of input arb1 is 16' h0000, it indicates that the unit does not need to output, and the result needs to be marked as invalid; when arb _ ctrl _ mask of input arb3 is 16' h0008, it is stated that no arbitration is required and the result of MAC3 is selected for normal output. For example, the control unit receives arb _ stall — 16' h0001 fed back by arb0, indicating that arbitration has failed; arb _ stall fed back by arb15 is 16 'h 8000, indicating arbitration failure, and arb _ stall fed back by arb15 is 16' h 0000.

When three bits of the arb _ ctrl _ mask are high, it indicates that 3 output results corresponding to the high bits need to be selected from the 16 output results for arbitration (for example, arb _ ctrl _ mask _ 16 'h 0007, which indicates that MAC0, MAC1, and MAC2 need to be arbitrated, for example, arb _ ctrl _ mask _ 16' h000E, which indicates that MAC3, MAC1, and MAC2 need to be arbitrated, for example, arb _ ctrl _ mask _ 16 'h 0016, which indicates that MAC4, MAC1, and MAC2 need to be arbitrated … … arb _ ctrl _ mask _ 16' h8006, which indicates that MAC15, MAC1, and MAC2 need to be arbitrated), and if at least 2 identical output results exist in the 3 output results, arbitration succeeds. When there are four bits high in arb _ ctrl _ mask, it indicates that 4 output results corresponding to the high bits need to be selected from the 16 output results for arbitration (for example, arb _ ctrl _ mask _ 16 'h 8007, which indicates that MAC0, MAC1, MAC2, and MAC15 need to be arbitrated, for example, arb _ ctrl _ mask _ 16' h8038, which indicates that MAC3, MAC4, MAC5, and MAC15 need to be arbitrated, for example, arb _ ctrl _ mask _ 16 'h 81C0, which indicates that MAC15, MAC6, MAC7, and MAC8 need to be arbitrated … …, arb _ ctrl _ mask _ 16' hF000, which indicates that MAC15, MAC14, MAC13, and MAC12 need to be arbitrated), there are at least 3 of the 4 output results, or there are 2 identical output results, and there are different output results. When the arbitration is successful, the same output result is output. When arbitration fails (3 output results are different, 4 output results are different, there are 2 sets of identical results), recalculation or error reporting is required.

WO is the output terminal of the arbitration unit arb 0-15, receives the output results of arb 0-15, aggregates them and stores them in the data buffer unit (date _ fifo). When the harden _ reg is not 0, the number of data groups input each time is necessarily less than 16, and the output port needs to correspondingly output 16 data groups, so that a data cache unit needs to be added for caching data, and the data cache unit in the WO only needs to store in sequence because the scheduling unit in the WI allocates threads according to a certain sequence. When the data storage number is 16 or more, the data reading unit (date _ pop) reads 16 sets of data output therefrom.

For the sake of understanding, the following will be described with reference to a specific example, with reference to an example shown in fig. 4, where fig. 4 is a polling hardening case when the arithmetic hardening level is harden reg ═ 1, and there are 14 computing units and a group of redundant units, and 14 computing units are polled by the group of redundant units. The process is as follows: the control unit obtains the operation reinforcement level from the register, and then the control unit determines a corresponding number of calculation units from 16 MACs according to the operation reinforcement level to form a calculation unit group, which corresponds to 14 calculation units and 2 redundancy units (assumed as MAC1 and MAC 2). clk1, the control unit sends signals of arb _ ctr _ wi _ compute 14 and arb _ ctrl _ wi _ check _ thread 16' h0001 to the scheduling unit, so that the scheduling unit selects 14 threads from the thread cache unit, inputs the data corresponding to the selected 14 threads into the computing unit in the computing unit group for operation, and inputs the data corresponding to the threads input into the computing unit to be reinforced (MAC0) into the first redundant unit (MAC1) and the second redundant unit (MAC2) in the corresponding redundant unit group for operation; and sending a signal of arb _ ctrl _ mask being 16 ' h0007 to arb0, so that arb0 selects from 16 output results, and the output results of MAC0, MAC1, and MAC2 are arbitrated, that is, it is determined whether there are at least two identical output results in the 3 output results (if arbitration fails, arb0 feeds arb _ stall being 16 ' h0001 back to the control unit, otherwise, feeds arb _ stall being 16 ' h0000 back); the control unit sends a signal to arb1, arb2 that arb _ ctrl _ mask is 16' h0000, indicating that the unit does not need to output; and signals with only 1 bti bits high to arb3-arb15 arb _ ctrl _ mask, so that each arb selects only the data output of the MAC unit corresponding to the high bit, for example, a signal with arb _ ctrl _ mask being 16 ' h0008 is sent to arb3, a signal with arb _ ctrl _ mask being 16 ' h0010 is sent to arb4, and a signal with arb _ ctrl _ mask being 16 ' h8000 is sent to arb15 by … …; the data buffering unit in WO buffers the 14 results received. clk2, the control unit sends signals of arb _ ctr _ wi _ compute 14 and arb _ ctrl _ wi _ check _ thread 16' h0002 to the scheduling unit, so that the scheduling unit selects 14 threads from the thread cache unit, inputs the data corresponding to the selected 14 threads into the computing unit in the computing unit group for operation, and inputs the data corresponding to the threads input into the computing unit to be reinforced (MAC3) into the first redundant unit (MAC1) and the second redundant unit (MAC2) in the corresponding redundant unit group for operation; and sending a signal of arb _ ctrl _ mask of 16 ' h000E to arb3 so that arb3 selects from 16 output results, and the output results of MAC3, MAC1, and MAC2 arbitrate (if arbitration succeeds, arb3 feeds arb _ stall of 16 ' h0008 to the control unit, otherwise feeds arb _ stall of 16 ' h 0000); the control unit sends a signal to arb1, arb2 that arb _ ctrl _ mask is 16' h0000, indicating that the unit does not need to output; and signals with only 1 bti bits high to arb0, arb4-arb15 arb _ ctrl _ mask, so that each arb selects only data output corresponding to the MAC unit, for example, a signal of arb _ ctrl _ mask — 16 ' h0001 is sent to arb0, a signal of arb _ ctrl _ mask — 16 ' h0010 is sent to arb4, and a signal of arb _ ctrl _ mask — 16 ' h8000 is sent to arb15 by … …; the data buffer unit in WO buffers the received 14 results, and since the buffered data reaches 28, the WO outputs 16 of the results in the order of distribution, and the remaining 12 results continue to be buffered. By analogy, clk16, the control unit sends signals of arb _ ctr _ wi _ compute of 14 and arb _ ctrl _ wi _ check _ thread of 16' h2000 to the scheduling unit, so that the scheduling unit selects 14 threads from the thread cache unit, inputs data corresponding to the selected 14 threads to the computing unit in the computing unit group for operation, and inputs data corresponding to the threads input to the computing unit to be reinforced (MAC15) to the first redundancy unit (MAC1) and the second redundancy unit (MAC2) in the corresponding redundancy unit group for operation respectively; and sending a signal of arb _ ctrl _ mask being 16 ' h8006 to arb15, so that arb15 selects from 16 output results, and the output results of MAC15, MAC1, and MAC2 are arbitrated, that is, it is determined whether there are at least two identical output results in the 3 output results (if arbitration is successful, arb15 feeds arb _ stall being 16 ' h8000 back to the control unit, otherwise feeds arb _ stall being 16 ' h0000 back); sending a signal of arb _ ctrl _ mask of 16' h0000 to arb1 and arb2, which indicates that the unit does not need to output; and signals with only 1 bti bits high are sent to arb0, arb4-arb 14 arb _ ctrl _ mask, so that each arb selects only the data output of the corresponding MAC unit, for example, a signal of arb _ ctrl _ mask being 16 ' h0001 is sent to arb0, a signal of arb _ ctrl _ mask being 16 ' h0010 is sent to arb4, and a signal of arb _ ctrl _ mask being 16 ' h4000 is sent to arb14 by … ….

Fig. 5 shows the polling hardening case when the arithmetic hardening level is Harden _ reg ═ 2. Besides MAC1 and MAC2, MAC4 and MAC5 are redundant units, and the whole wave (64 threads) needs 6 clks to complete the operation. When polling is strengthened, two groups of redundant units are polled as a whole, if the strengthening condition of the 2 groups of redundant units on 2 computing units in 12 computing units at the current moment is shown as part a in fig. 5, then the next moment, the condition of the 2 groups of redundant units as a whole is shown as part b in fig. 5, and the following is analogized in sequence. The principle of reinforcing the to-be-reinforced computing unit by using each group of redundant units is similar, and a specific control flow may refer to the control flow shown in fig. 4, which is not described herein again.

Fig. 6 shows a polling hardening case when the operation hardening level is Harden _ reg ═ 5, and there are 5 sets of redundant units and 6 calculation units. Basically, all the possible false triggering conditions can be covered, 5 groups of redundant units are polled as a whole, if the reinforcement condition of the 5 groups of redundant units on 5 calculation units in 6 calculation units at the current moment is shown as part a in fig. 6, and the polling condition of the 5 groups of redundant units as a whole is shown as part b in fig. 6 at the next moment. 5/6 are in a hardened state for a single compute unit, the corresponding compute time for 64 threads becomes 12 clk. The principle of reinforcing the to-be-reinforced computing unit by using each group of redundant units is similar, and a specific control flow may refer to the control flow shown in fig. 4, which is not described herein again.

Fig. 7 shows a polling hardening case when the arithmetic hardening level is Harden _ reg ═ 6, and there are 5 sets of redundant units, 5 calculation units, and 1 polling redundant unit. Only the polling redundant unit will poll each computing unit, so for a single computing unit, the computing unit is in a hardened state at any time, and there is polling at 1/5, the hardening degree is higher, and the time for completing 64 thread computations becomes 13 clk. In this embodiment, a process of the control unit sending the control signal to the scheduling unit may refer to a control flow shown in fig. 4, which is not described herein again, and a process of the control unit sending the control signal to the arbitration unit is described below with reference to fig. 7. clk1, the redundant polling unit MAC15 reinforces MAC0, the control unit sends a signal of arb _ ctrl _ mask being 16 ' h8007 to arb0, so that arb0 selects from 16 output results, the output results of MAC15, MAC0, MAC1, and MAC2 arbitrate, that is, it is determined whether there are at least three same output results or there are 2 same output results and the other two output results are different from each other, if arbitration succeeds, arb0 feeds back arb _ stall being 16 ' h0001 to the control unit, otherwise, feed back arb _ stall being 16 ' h 0000; and sending a signal of arb _ ctrl _ mask being 16 ' h0038 to arb3, so that arb3 selects from 16 output results, the output results of MAC3, MAC4, MAC5 are arbitrated, arb3 feeds arb _ stall being 16 ' h0008 to the control unit if arbitration succeeds, otherwise feeds arb _ stall being 16 ' h 0000; sending a signal arb _ ctrl _ mask of 16 ' h01C0 to arb6 so that arb6 selects from 16 output results, the output results of MAC6, MAC7, MAC8 are performed, arb6 feeds arb _ stall of 16 ' h0040 back to the control unit if arbitration succeeds, otherwise feeds arb _ stall of 16 ' h0000 back; sending a signal of arb _ ctrl _ mask being 16 ' h0E00 to arb9 so that arb9 selects from 16 output results, arbitrating the output results of MAC9, MAC10, and MAC11, if the arbitration succeeds, arb9 feeds arb _ stall being 16 ' h0200 back to the control unit, otherwise, feeds arb _ stall being 16 ' h0000 back; sending a signal of arb _ ctrl _ mask being 16 ' h7000 to arb12, so that arb12 selects from 16 output results, the output results of MAC12, MAC13, and MAC14 are arbitrated, if the arbitration succeeds, arb12 feeds arb _ stall being 16 ' h0000 back to the control unit, otherwise feeds arb _ stall being 16 ' h0000 back; a signal indicating that the unit does not need to output is sent to arb1, arb2, arb4, arb5, arb7, arb8, arb10, arb11, arb13, arb14, and arb15, that arb _ ctrl _ mask is 16' h 0000. clk2, the redundant polling unit MAC15 reinforces MAC3, the control unit sends arb0 a signal of arb _ ctrl _ mask 16' h,0007, so that arb0 selects from 16 output results, and the output results of MAC0, MAC1, and MAC2 arbitrate; and sending a signal of arb _ ctrl _ mask of 16' h8038 to arb3, so that arb3 selects from 16 output results, and the output results of MAC15, MAC3, MAC4 and MAC5 arbitrate; sending a signal of arb _ ctrl _ mask being 16' h01C0 to arb6 so that arb6 selects from 16 output results, and the output results of MAC6, MAC7 and MAC8 arbitrate; sending a signal of arb _ ctrl _ mask being 16' h0E00 to arb9 so that arb9 selects from 16 output results, and the output results of MAC9, MAC10 and MAC11 arbitrate; sending a signal of 16' h7000 to arb12 to enable arb12 to select from 16 output results, and performing arbitration on the output results of MAC12, MAC13 and MAC 14; a signal indicating that the unit does not need to output is sent to arb1, arb2, arb4, arb5, arb7, arb8, arb10, arb11, arb13, arb14, and arb15, that arb _ ctrl _ mask is 16' h 0000. By analogy, clk5, the redundant polling unit MAC15 reinforces MAC12, the control unit sends arb0 a signal of arb _ ctrl _ mask ═ 16' h,0007, so that arb0 selects from 16 output results, and the output results of MAC0, MAC1, and MAC2 arbitrate; and sending a signal of arb _ ctrl _ mask 16' h0038 to arb3 to select arb3 from the 16 output results, and arbitrating the output results of MAC15, MAC3, MAC4, and MAC 5; sending a signal of arb _ ctrl _ mask being 16' h01C0 to arb6 so that arb6 selects from 16 output results, and the output results of MAC6, MAC7 and MAC8 arbitrate; sending a signal of arb _ ctrl _ mask being 16' h0E00 to arb9 so that arb9 selects from 16 output results, and the output results of MAC9, MAC10 and MAC11 arbitrate; sending a signal of arb _ ctrl _ mask being 16' hF000 to arb12 so that arb12 selects from 16 output results, and the output results of MAC12, MAC13, MAC14, and MAC15 arbitrate; a signal indicating that the unit does not need to output is sent to arb1, arb2, arb4, arb5, arb7, arb8, arb10, arb11, arb13, arb14, and arb15, that arb _ ctrl _ mask is 16' h 0000.

In fig. 4, 5, 6, and 7, the MAC in the gray box is a redundancy unit, and the MAC in the white box is a calculation unit.

After the functions of the modules shown in fig. 3 in the process of performing operation strengthening on the computing units in the computing unit group are introduced, a method for performing operation strengthening on the computing units applied to the parallel processor according to the embodiment of the present application will be described below with reference to fig. 8.

Step S101: and respectively inputting data corresponding to the same thread into a calculation unit to be reinforced and a first redundant unit and a second redundant unit in a redundant unit group corresponding to the calculation unit to be reinforced for operation.

When a to-be-consolidated calculation unit in a calculation unit group is consolidated, data corresponding to the same thread input to the to-be-consolidated calculation unit are respectively input into a first redundant unit and a second redundant unit in a redundant unit group corresponding to the to-be-consolidated calculation unit for operation, wherein the to-be-consolidated calculation unit is a calculation unit in the calculation unit group. The part corresponding to the device, namely the scheduling unit, is used for inputting the data corresponding to the same thread into the computing unit to be consolidated and the first redundant unit and the second redundant unit in the redundant unit group corresponding to the computing unit to be consolidated respectively for operation under the control of the control unit.

In order to adapt to different reinforcement requirements, dynamic adjustment of an operation reinforcement level can be realized by using a multi-bit register (harden reg), and in this embodiment, before the control scheduling unit respectively inputs data corresponding to the same thread into the to-be-reinforced computing unit and the first redundant unit and the second redundant unit in the redundant unit group corresponding to the to-be-reinforced computing unit for operation, the control unit is further configured to determine, from the plurality of computing units, the number of computing units corresponding to the operation reinforcement level according to the operation reinforcement level obtained from the register, so as to form the computing unit group. Correspondingly, the scheduling unit is configured to select, under the control of the control unit, threads of which the number is the same as that of the computing units in the computing unit group, input data corresponding to the selected threads to the computing units in the computing unit group for operation, and input data corresponding to the threads input to the computing unit to be consolidated to the first redundant unit and the second redundant unit in the redundant unit group corresponding to the computing unit to be consolidated respectively for operation, where one computing unit corresponds to one thread.

Step S102: and judging whether at least two identical output results exist in the output result of the first redundancy unit, the output result of the second redundancy unit and the output result of the to-be-reinforced calculation unit.

Judging whether at least two identical output results exist in the three output results by judging the output result of the first redundancy unit, the output result of the second redundancy unit and the output result of the calculation unit to be reinforced, if so, executing the step S103, otherwise, returning to the step S101. For example, the arbitration unit is used for judging whether at least two identical output results exist in the output result of the first redundancy unit, the output result of the second redundancy unit and the output result of the to-be-consolidated calculation unit under the control of the control unit, and feeding back the arbitration result to the control unit.

Step S103: outputting the same output result when there are at least two same output results.

If the arbitration is successful, outputting the same output result when at least two same output results exist. If the arbitration fails, that is, there are no at least two same output results, which is equivalent to that when 3 results are different from each other, the input of the thread after the moment of the same thread is suspended, and the above steps are repeated, that is, the input of the thread after the moment of the same thread is suspended, and the data corresponding to the same thread is input into the computing unit to be reinforced and the first redundant unit and the second redundant unit in the redundant unit group corresponding to the computing unit to be reinforced again to perform operation, and whether there are at least two same latest output results in the latest output result of the first redundant unit, the latest output result of the second redundant unit, and the latest output result of the computing unit to be reinforced exists is judged, when there are at least two same latest output results, the same latest output result is output, and after the operation is performed again, the latest output result of the first redundant unit, And if the latest output result of the second redundancy unit and the latest output result of the calculation unit to be reinforced are different, an error is reported.

For convenience of understanding, the following description is made with reference to table 2, when clk1, 14 threads, namely, Thread0 to Thread13, are input to 14 computing units, namely, MAC0 and MAC3 to MAC15, wherein data corresponding to Thread0 is input to MAC0, MAC1, and MAC2, respectively, for operation, if feedback arbitration of arb0 fails, subsequent Thread input of clk1 is suspended, the 14 threads, namely, Thread0 to Thread13, are input to MAC0, MAC3, and MAC15, respectively, for re-operation, if arb0 again arbitrates the latest output results output by MAC0, MAC1, and MAC2, respectively, still fails, an error is reported, and if arbitration succeeds, the subsequent process continues, for example, entering 2. It should be noted that, when the arb0 fails to perform feedback arbitration, only the data corresponding to Thread0 may be input into MAC0, MAC1, and MAC2, respectively, to perform the operation again, and it is not necessary to input the 13 threads Thread of Thread1 to Thread13 into the 13 computing units of MAC3 to MAC15 to perform the operation again.

In addition, the method further comprises: and caching the output results of the other computing units except the computing unit to be reinforced in the computing unit group and the same output result, and outputting the output results of the preset number according to a first-in first-out sequence when the output results to be cached reach the preset number (such as 16). That is, the data caching unit is configured to cache output results of the remaining computing units in the computing unit group except the computing unit to be consolidated and the same output result under the control of the control unit, and output a preset number of output results according to a first-in first-out sequence when the output results to be cached reach the preset number. For example, in clk1, the data buffering unit buffers the output results of arb0 and arb3 to arb15, and since the buffered data does not reach the preset number (e.g., 16), the buffered data is not output, and when clk2 buffers the output results of arb0 and arb3 to arb15, and when the buffered data reaches 28, the first 16 data are output in sequence. If the arb0 fails to arbitrate at the clk1, the arbitration unit also sends an arb _ stall signal to the WO, so that the data buffering unit does not buffer the result output by the clk 1.

When the computing units in the computing unit group are reinforced, if the number of the redundant unit groups is smaller than that of the computing units in the computing unit group, the computing units in the computing unit group are reinforced in a polling mode according to a preset sequence by the redundant unit groups. When the number of the redundant unit groups is equal to the number of the computing units in the computing unit group, the remaining MAC serves as a polling redundant unit, and polling reinforcement is performed on the computing units in the computing unit group according to a preset sequence.

When the polling redundant unit is used for polling and reinforcing the computing units in the computing unit group according to the preset sequence, at this time, the method further comprises the following steps: inputting data corresponding to the same thread into a polling redundancy unit for operation; accordingly, step S102 is replaced with: and judging whether at least three same output results exist in the output result of the first redundancy unit, the output result of the second redundancy unit, the output result of the polling redundancy unit and the output result of the to-be-consolidated calculation unit, or two same output results exist and the other two output results are different, wherein when at least three same output results exist, or two same output results exist and the other two output results are different, the same output results are output, and a flow chart is shown in fig. 9. In the embodiment shown in fig. 9, if the arbitration fails, the subsequent thread input is also suspended, and the data corresponding to the same thread is input into the to-be-consolidated calculation unit, the polling redundancy unit, and the first redundancy unit and the second redundancy unit in the redundancy unit group corresponding to the to-be-consolidated calculation unit again for operation, and it is determined again whether there are at least three same latest output results in the 4 latest output results, or there are two same latest output results and the other two latest output results are different.

The implementation principle and the generated technical effects of the computing unit operation reinforcement method provided by the embodiment of the present application are the same as those of the device embodiment described above, and for brief description, reference may be made to corresponding contents in the device embodiment described above where no part of the method embodiment is mentioned.

As shown in fig. 10, fig. 10 is a block diagram illustrating a structure of an electronic device 200 according to an embodiment of the present application. The electronic device 200 includes: a transceiver 210, a memory 220, a communication bus 230, and a parallel processor 240, a register 250. The register is used for configuring the operation reinforcement level and adapting to reinforcement requirements of different levels, and is also used for configuring which of the 16 MACs are calculation units and which are redundancy units.

The elements of the transceiver 210, the memory 220, the parallel processor 240, and the register 250 are electrically connected to each other directly or indirectly to achieve data transmission or interaction. For example, the components may be electrically coupled to each other via one or more communication buses 230 or signal lines. The transceiver 210 is used for transceiving data. The memory 220 is used for storing a computer program including at least one software functional module which may be stored in the memory 220 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the electronic device 200. The parallel processor 240 is used for executing the executable modules or computer programs stored in the memory 220. For example, the parallel processor 240 is configured to input data corresponding to the same thread into a to-be-consolidated calculation unit and a first redundant unit and a second redundant unit in a redundant unit group corresponding to the to-be-consolidated calculation unit, respectively, to perform an operation; wherein the computing unit to be reinforced is a computing unit in the computing unit group; judging whether at least two identical output results exist in the output result of the first redundancy unit, the output result of the second redundancy unit and the output result of the to-be-reinforced calculation unit; outputting the same output result when there are at least two same output results.

The Memory 220 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

The parallel processor 240 may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the parallel processor 240 may be any conventional processor or the like.

The electronic device 200 includes, but is not limited to, a smart phone, a tablet, a computer, a server, and the like.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a notebook computer, a server, or an electronic device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The computing unit operation strengthening method is applied to a parallel processor, the parallel processor comprises a computing unit group and a redundant unit group, and the redundant unit group comprises: a first redundancy unit and a second redundancy unit, the method comprising:

respectively inputting data corresponding to the same thread into a computing unit to be reinforced and a first redundant unit and a second redundant unit in a redundant unit group corresponding to the computing unit to be reinforced for operation; wherein the computing unit to be reinforced is a computing unit in the computing unit group;

judging whether at least two identical output results exist in the output result of the first redundancy unit, the output result of the second redundancy unit and the output result of the to-be-reinforced calculation unit;

outputting the same output result when there are at least two same output results.

2. The method according to claim 1, wherein the parallel processor is connected to a register, and before inputting data corresponding to the same thread into the computational unit to be consolidated and the first redundant unit and the second redundant unit in the redundant unit group corresponding to the computational unit to be consolidated for operation, the method further comprises:

according to the operation reinforcement level obtained from the register, determining the number of computing units corresponding to the operation reinforcement level from all computing units of the parallel processor to form the computing unit group;

correspondingly, inputting data corresponding to the same thread into a computing unit to be consolidated and a first redundant unit and a second redundant unit in a redundant unit group corresponding to the computing unit to be consolidated respectively for operation, comprising:

selecting threads with the same number as the number of the computing units in the computing unit group, inputting data corresponding to the selected threads into the computing units in the computing unit group for operation, and respectively inputting data corresponding to the threads input into the computing units to be reinforced into a first redundant unit and a second redundant unit in a redundant unit group corresponding to the computing units to be reinforced for operation, wherein one computing unit corresponds to one thread.

3. The method of claim 2, further comprising:

and caching the output results of the other computing units except the computing unit to be reinforced in the computing unit group and the same output result, and outputting the output results of the preset number according to a first-in first-out sequence when the cached output results reach the preset number.

4. The method of claim 1, wherein after determining whether there are at least two identical output results among the output result of the first redundant unit, the output result of the second redundant unit, and the output result of the to-be-consolidated computing unit, the method further comprises:

when at least two same output results do not exist, stopping the input of the thread after the moment of the same thread, and inputting the data corresponding to the same thread into the computing unit to be reinforced and the first redundant unit and the second redundant unit in the redundant unit group corresponding to the computing unit to be reinforced respectively again for operation;

judging whether at least two same latest output results exist in the latest output result of the first redundancy unit, the latest output result of the second redundancy unit and the latest output result of the to-be-reinforced calculation unit;

outputting the same latest output result when there are at least two same latest output results.

5. The method of claim 4, wherein after determining whether there are at least two identical latest output results among the latest output result of the first redundant unit, the latest output result of the second redundant unit, and the latest output result of the to-be-consolidated computing unit, the method further comprises:

and if the latest output result of the first redundancy unit, the latest output result of the second redundancy unit and the latest output result of the to-be-reinforced computing unit are different, an error is reported.

6. The method according to claim 1, wherein the number of the redundant cell groups is smaller than the number of the computing units in the computing unit groups, and the computing units in the computing unit groups are polled and hardened in a preset order by using the redundant cell groups.

7. The method of claim 1, wherein the number of the set of redundancy units is equal to the number of compute units in the set of compute units, the parallel processor further comprising: polling redundant units, the method further comprising:

inputting the data corresponding to the same thread into the polling redundancy unit for operation; correspondingly, judging whether at least two identical output results exist in the output result of the first redundancy unit, the output result of the second redundancy unit and the output result of the to-be-consolidated calculation unit comprises the following steps:

judging whether at least three same output results exist in the output result of the first redundancy unit, the output result of the second redundancy unit, the output result of the polling redundancy unit and the output result of the to-be-consolidated calculation unit, or two same output results exist and the other two output results are different;

when there are at least three identical output results, or there are two identical output results and the other two output results are different, the identical output results are output.

8. The method of claim 7, wherein the polling redundancy unit is used to poll the computing units in the computing unit group in a preset order.

9. A parallel processor comprising a Single Instruction Multiple Data (SIMD) architecture, the SIMD architecture comprising:

a plurality of computing units, a part of which is taken as a redundant unit to form at least one redundant unit group, and the rest of which is a computing unit group, each of which includes a first redundant unit and a second redundant unit;

a control unit;

the scheduling unit is used for respectively inputting data corresponding to the same thread into a computing unit to be reinforced and a first redundant unit and a second redundant unit in a redundant unit group corresponding to the computing unit to be reinforced under the control of the control unit to carry out operation; wherein the computing unit to be reinforced is a computing unit in the computing unit group;

and the arbitration unit is used for judging whether at least two identical output results exist in the output result of the first redundancy unit, the output result of the second redundancy unit and the output result of the calculation unit to be consolidated under the control of the control unit, and outputting the identical output results when at least two identical output results exist.

10. The parallel processor according to claim 9, wherein the parallel processor is connected to a register, and the control unit is further configured to, before controlling the scheduling unit to input the data corresponding to the same thread into the to-be-hardened computing unit and the first redundant unit and the second redundant unit in the redundant unit group corresponding to the to-be-hardened computing unit for operation, determine, from the plurality of computing units, a number of computing units corresponding to the operation hardening level according to an operation hardening level obtained from the register, and form the computing unit group, where the number of computing units corresponding to different operation hardening levels is different; correspondingly, the scheduling unit is configured to select, under the control of the control unit, threads of which the number is the same as that of the computing units in the computing unit group, input data corresponding to the selected threads to the computing units in the computing unit group for operation, and input data corresponding to the threads input to the computing unit to be consolidated to the first redundant unit and the second redundant unit in the redundant unit group corresponding to the computing unit to be consolidated respectively for operation, where one computing unit corresponds to one thread.

11. The parallel processor of claim 10, wherein the SIMD architecture further comprises: and the data caching unit is used for caching the output results of the other computing units except the computing unit to be reinforced in the computing unit group and the same output result under the control of the control unit, and outputting the output results of the preset number according to a first-in first-out sequence when the output results to be cached reach the preset number.

12. The parallel processor according to claim 9, wherein the arbitration unit is further configured to feed back the determination result to the control unit, and the control unit is further configured to, when there are no at least two identical output results, suspend input of a thread after a time at which the same thread is located, and control the scheduling unit, and input data corresponding to the same thread again into the to-be-hardened computing unit and the first redundant unit and the second redundant unit in the redundant unit group corresponding to the to-be-hardened computing unit, respectively, to perform operation;

the arbitration unit is further configured to determine whether at least two identical latest output results exist in the latest output result of the first redundancy unit, the latest output result of the second redundancy unit, and the latest output result of the to-be-consolidated calculation unit, and output the identical latest output result when at least two identical latest output results exist.

13. The parallel processor according to claim 12, wherein the control unit is further configured to report an error if the latest output result of the first redundant unit, the latest output result of the second redundant unit, and the latest output result of the to-be-consolidated calculation unit are different.

14. The parallel processor according to claim 9, wherein the number of the redundant cell groups is smaller than the number of the computing units in the computing unit group, and the control unit is further configured to perform polling hardening on the computing units in the computing unit group according to a preset order by using the redundant cell groups.

15. Parallel processor according to claim 9, wherein the number of the redundancy unit groups is equal to the number of the computation units of the computation unit group, at which time a part of the plurality of computation units are also used as polling redundancy units;

the scheduling unit is further configured to input data corresponding to the same thread into the polling redundancy unit for operation under the control of the control unit;

the arbitration unit is further configured to determine whether at least three identical output results exist in the output result of the first redundancy unit, the output result of the second redundancy unit, the output result of the polling redundancy unit, and the output result of the to-be-consolidated calculation unit, or two identical output results exist and the other two output results are different, where the identical output results are output when at least three identical output results exist, or two identical output results exist and the other two output results are different.

16. The parallel processor according to claim 15, wherein the control unit performs polling hardening on the computing units in the computing unit group in a preset order by using the polling redundancy unit.

17. An electronic device, comprising: an ontology and a parallel processor as claimed in any one of claims 9 to 16.

18. The electronic device of claim 17, further comprising: and the register is connected with the parallel processor and used for responding to the operation reinforcement level configuration operation input by a user and finishing the corresponding configuration operation.