WO2022001439A1

WO2022001439A1 - Computing apparatus, integrated circuit chip, board and computing method

Info

Publication number: WO2022001439A1
Application number: PCT/CN2021/094468
Authority: WO
Inventors: 刘少礼; 陶劲桦; 刘道福; 周聖元
Original assignee: 上海寒武纪信息科技有限公司
Priority date: 2020-06-30
Filing date: 2021-05-18
Publication date: 2022-01-06
Also published as: CN113867790A

Abstract

A computing apparatus, an integrated circuit chip, a board, and a method for executing arithmetic operations using the described computing apparatus. The computing apparatus may be included in a combined processing apparatus, and the combined processing apparatus may further comprise a universal interconnecting interface and other processing apparatuses. The computing apparatus interacts with the other processing apparatuses to jointly complete a computing operation designated by a user. The combined processing apparatus may further comprise a storage apparatus, and the storage apparatus is respectively connected to a device and the other processing apparatuses and is used for storing data of the device and the other processing apparatuses.

Description

Computing device, integrated circuit chip, board and computing method

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the priority of the Chinese patent application filed on June 30, 2020, the application number is 2020106181203, and the title is "Computing Device, Integrated Circuit Chip, Board Card, and Computing Method", which is hereby incorporated by reference in its entirety. .

technical field

This disclosure relates generally to the field of data processing. More particularly, the present disclosure relates to a computing device, an integrated circuit chip, a board, and a method of performing computing operations using the aforementioned computing device.

Background technique

Existing artificial intelligence operations often include a large number of data operations, such as convolution operations, image processing, etc. As the amount of data increases, the amount of operations and storage involved in data operations such as matrix operations will increase sharply due to the increase in the size of the data. In an existing computing method, a general-purpose processor such as a central processing unit ("CPU") or a graphics processing unit ("GPU") is usually used for computing. However, general-purpose processors often have high power consumption due to their general-purpose features and high device redundancy, thus resulting in limited performance.

In addition, the existing operation processing circuit usually adopts a fixed hardware architecture. When the data scale expands or the data format changes, it may not only be unable to support a certain type of operation, but also cause its operation performance to be extremely high during the operation. Limited, or even inoperable.

SUMMARY OF THE INVENTION

In order to at least solve the above-mentioned defects in the prior art, the present disclosure provides a solution that supports multiple types of operations, improves operation efficiency, and saves operation cost and overhead. Specifically, the present disclosure provides the aforementioned solutions in the following aspects.

In a first aspect, the present disclosure provides a computing device comprising a control circuit and a plurality of processing circuits, wherein: the control circuit is configured to obtain an instruction and parse the instruction, and send the parsed instruction to a plurality of one or more of the processing circuits; and the plurality of processing circuits configured to be connected in a one-dimensional or multi-dimensional array and to perform multi-threaded operations in accordance with the received parsed instructions.

In a second aspect, the present disclosure provides an integrated circuit chip comprising a computing device of various embodiments of the foregoing and later described.

In a third aspect, the present disclosure provides a board including the aforementioned integrated circuit chip.

In a fourth aspect, the present disclosure provides a method of performing an arithmetic operation using a computing device, wherein the computing device includes a control circuit and a plurality of processing circuits connected in a one-dimensional or multi-dimensional array structure, the method comprising: utilizing The control circuit obtains the instruction and parses the instruction, and sends the parsed instruction to one or more processing circuits in the plurality of processing circuits; and utilizes the one or more processing circuits to The parsed instructions perform multithreaded operations.

By using the computing device, integrated circuit chip, board and method disclosed in the present disclosure, it is possible to overcome the operational limitations under a fixed hardware architecture, and to improve the operational efficiency of data processing and computing in various data processing fields, including the field of artificial intelligence. , and reduce the power overhead and cost of data operations.

Description of drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the accompanying drawings, several embodiments of the present disclosure are shown by way of example and not limitation, and like or corresponding reference numerals refer to like or corresponding parts, wherein:

FIG. 1 is an overall architecture diagram illustrating a computing device according to an embodiment of the present disclosure;

FIG. 2 is an exemplary specific architecture diagram illustrating a computing device according to an embodiment of the present disclosure;

3 is an example block diagram illustrating a single type of processing circuit array of a computing device according to an embodiment of the present disclosure;

4 is an example block diagram illustrating various types of processing circuit arrays of a computing device according to an embodiment of the present disclosure;

5a, 5b, 5c and 5d are schematic diagrams illustrating various connection relationships of a plurality of processing circuits according to an embodiment of the present disclosure;

6a, 6b, 6c and 6d are schematic diagrams illustrating further various connection relationships of a plurality of processing circuits according to an embodiment of the present disclosure;

Figures 7a, 7b, 7c and 7d are schematic diagrams illustrating various loop structures of processing circuits according to embodiments of the present disclosure;

Figures 8a, 8b and 8c are schematic diagrams illustrating further various loop structures of processing circuits according to embodiments of the present disclosure;

Figures 9a, 9b, 9c and 9d are schematic diagrams illustrating data splicing operations performed by a preprocessing circuit according to an embodiment of the present disclosure;

10a, 10b and 10c are schematic diagrams illustrating data compression operations performed by a post-processing circuit according to an embodiment of the present disclosure;

11 is a simplified flowchart illustrating a method of using a computing device to perform an arithmetic operation according to an embodiment of the present disclosure;

FIG. 12 is a block diagram illustrating a combined processing apparatus according to an embodiment of the present disclosure; and

FIG. 13 is a schematic structural diagram illustrating a board according to an embodiment of the present disclosure.

detailed description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present disclosure.

The specific embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

FIG. 1 is a general architectural diagram illustrating a computing device 100 according to an embodiment of the present disclosure. As shown in FIG. 1 , the computing device 100 of the present disclosure may include a control circuit 102 and a plurality of processing circuits 104 . In performing data processing, the control circuit may be configured to obtain and parse the instruction, and may send the parsed instruction to one or more of the plurality of processing circuits.

According to aspects of the present disclosure, the fetched instructions may include one or more opcodes, and each opcode may represent one or more specific operations to be performed by one or more processing circuits. Each opcode can be represented in any suitable form. For example, an opcode can be represented by an English abbreviation such as "ADD" or "MUL" to express that what is to be performed is an "addition" or "multiplication" operation. In addition, the operation code can also be represented by an English abbreviation such as "AM" that cannot directly determine the specific operation literally. Depending on the application scenario, the opcode may include or involve different types of operations, for example, may include arithmetic operations such as addition or multiplication, logical operations, comparison operations, or table lookup operations, or any of the foregoing types of operations. combination. Further, in the present disclosure, each opcode may correspond to one or more microinstructions obtained in the process of parsing the instruction. Thus, the parsed instruction of the present disclosure may include one or more micro-instructions corresponding to an opcode in the instruction to indicate one or more specific operations to be performed by the processing circuit.

In one embodiment, in the process of parsing the instruction, the control circuit 102 may be configured to acquire instruction identification information in the instruction, and send the parsed instruction to the one or more of the plurality of processing circuits, wherein one or more processing circuits are identified in the instruction identification information. Further, according to different application scenarios, the parsed instruction here may be an instruction decoded by the control circuit or an parsed instruction that has not been decoded by the control circuit. When the parsed instruction is an parsed instruction that has not been decoded by the control circuit, a corresponding decoding circuit may be included in the processing circuit to decode the parsed instruction, for example, to obtain a plurality of micro-instructions.

In another embodiment, in the process of parsing the instruction, the control circuit may be configured to decode the instruction, and according to the result of the decoding and the operating states of the plurality of processing circuits, The parsed instructions are sent to one or more of the plurality of processing circuits. In this embodiment, multiple processing circuits may all support non-specific operations of the same type. Therefore, in order to improve the utilization rate and operation efficiency of the processing circuit, the parsed instruction may be sent to the processing circuit whose occupancy rate is not high or is in an idle state.

In one or more embodiments, the plurality of processing circuits 104 may be configured to be connected in a one-dimensional or multi-dimensional array structure and to perform multi-threaded operations according to the received parsed instructions. In one embodiment, the plurality of processing circuits may be configured to receive and execute the parsed instructions in a single instruction multithreading ("SIMT") fashion. In another embodiment, when multiple processing circuits are configured to be connected in a multi-dimensional array structure, the multi-dimensional array may comprise a two-dimensional array and/or a three-dimensional array (as shown in FIGS. 5 and 6 ). Further, each processing circuit in the aforementioned one-dimensional or multi-dimensional array can be connected to other processing circuits in a specified direction and a predetermined spacing pattern within a certain range. Additionally, multiple processing circuits may be logically connected in series to form one or more closed loops (as shown in Figures 7 and 8).

In different application scenarios, the connection mode between the multiple processing circuits may be a hard-wired connection through a hardware structure. Additionally or alternatively, the connection manner between the multiple processing circuits may also be a logical connection manner configured according to parsed instructions, such as microinstructions. Through the above-mentioned hard connection manner or logical connection manner, various topology structures of the processing circuit arrays can be formed so as to be suitable for performing corresponding data processing operations.

FIG. 2 is a diagram illustrating an example specific architecture of a computing device 200 according to an embodiment of the present disclosure. As can be seen from FIG. 2 , the computing device 200 not only includes the control circuit 102 and the plurality of processing circuits 104 of the computing device 100 in FIG. 1 , but also further shows a plurality of circuits included in the processing circuit, and additionally a plurality of other devices. Since the functions of the control circuit and the processing circuit have been described in detail above with reference to FIG. 1 , they will not be repeated below.

As shown in FIG. 2, the processing circuit 104 may include a logic operation circuit 1041, which may be configured to perform a logic operation according to the parsed instruction and the received data when performing the multi-threaded operation, for example, perform a logic operation on the received data Logical operations such as AND or NOT, shift operations, or compare operations. In addition to performing the necessary logical operations, the processing circuit 104 may also include an arithmetic operation circuit 1043, which may be configured to perform arithmetic operations, such as linear operations such as addition, subtraction, or multiplication.

In one embodiment, processing circuitry 104 may also include storage circuitry 1042 including data storage circuitry and/or predicate storage circuitry, wherein the data storage circuitry may be configured to store operational data (eg, pixels) of the processing circuitry and At least one of the intermediate operation results. Further, the predicate storage circuit may be configured to store the predicate storage circuit serial number and predicate information of each of the processing circuits obtained using the parsed instruction. In a specific storage application, the storage circuit 1042 may be implemented by using a memory such as a register or a static random access memory ("SRAM") according to actual needs.

In an application scenario, the predicate storage circuit may include a 1-bit register for storing predicate information. Further, the serial numbers of a 1-bit registers can be represented by a b-bit binary number, where b>=log ₂ (a). For example, the predicate storage circuit in the processing circuit may include 32 1-bit registers sequentially numbered from 00000 to 11111. Thus, the processing circuit can read the predicate information in the register corresponding to the serial number "00101" according to the register serial number "00101" specified in the received parsed instruction.

In one embodiment, the predicate storage circuit may be configured to update the predicate information according to the parsed instruction. For example, the predicate information may be directly updated according to the configuration information in the parsed instruction, or the configuration information may be acquired according to the configuration information storage address provided in the parsed instruction, so as to update the predicate information. During the execution of the operation by the processing circuit, the predicate storage circuit may also update the predicate information according to the comparison result of each of the processing circuits, which is a form of operation result in the context of the present disclosure. For example, the predicate information may be updated using input data received by the processing circuit compared to stored data in its data storage circuit. When the input data is greater than the stored data, the predicate information of the processing circuit is set to 1. Conversely, when the input data is smaller than the stored data, the predicate information is set to 0, or its original value is kept unchanged.

Before performing the arithmetic operation, each processing circuit may determine whether the processing circuit executes the operation of the parsed instruction according to the information in the parsed instruction. Further, each of the processing circuits may be configured to obtain the predicate information corresponding to the predicate storage circuit according to the sequence number of the predicate storage circuit in the parsed instruction, and determine the predicate information according to the predicate information. Whether the processing circuit executes the parsed instruction. For example, when the value of the predicate information read by the processing circuit according to the sequence number of the predicate storage circuit specified in the parsed instruction is 1, it means that the processing circuit executes the parsed instruction. For example, it may be to make the processing circuit read the data pointed in the instruction, and store the read data into the data storage circuit of the processing circuit. Conversely, when the value of the predicate information read by the processing circuit according to the sequence number of the predicate storage circuit specified in the parsed instruction is 0, it means that the processing circuit does not execute the parsed instruction.

In one embodiment, the computing device 200 of the present disclosure may also include data processing circuitry 106 , which may include at least one of pre-processing circuitry 1061 and post-processing circuitry 1062 . The preprocessing circuit 1061 may be configured to perform a preprocessing operation (described later in conjunction with FIG. 7b ) on the operation data before the processing circuit performs the operation, such as performing a data splicing or data placement operation. The post-processing circuit 1062 may be configured to perform a post-processing operation on the result of the operation after the processing circuit performs the operation, such as performing a data restoration or data compression operation.

In order to realize the transmission and storage of data, the computing device 200 may further include a main storage circuit 108, which can not only receive and store data from the control circuit as input data of the processing circuit, but also be used to transmit and store multiple data between processing circuits. In some application scenarios, the main storage circuit 108 may be further divided into at least one of a main storage module 1081 and a main cache module 1082 according to the storage method or the characteristics of the stored data. The main storage module 1081 may be configured to store data (eg, input pixels) to be performed operations in the processing circuit and operation results (eg, output pixels) after performing operations. The main cache module 1082 may be configured to cache intermediate operation results after operations performed in the plurality of processing circuits. In some application scenarios, the main storage circuit can not only perform internal storage, but also support the function of data interaction with storage devices outside the computing device of the present disclosure, for example, it can communicate with external storage devices through direct memory access ("DMA"). device for data exchange.

3 is an example block diagram illustrating a single type of processing circuit array of a computing device according to an embodiment of the present disclosure. As shown in FIG. 3, the computing device shown not only includes the control circuit 102, the main storage circuit 108, the data processing circuit 106, and a plurality of processing circuits 104 of the same type shown in FIG. Multiple processing circuits of the same type may be arranged through physical connections to form a two-dimensional array, eg, multiple processing circuits are arranged to form a two-dimensional array. In view of the detailed description of the functions of the control circuit, the main storage circuit, the data processing circuit, and the processing circuit with reference to FIG. 2 above, the details will not be repeated here.

As previously mentioned, the plurality of processing circuits of the present disclosure may be divided according to type for performing different types of data processing operations. For example, the plurality of processing circuits may be divided into first type processing circuits and second type processing circuits (as shown in FIG. 4). In an application scenario, the first type of processing circuit may be adapted to perform at least one of arithmetic operations and logical operations, and the second type of processing circuit may be adapted to perform at least one of comparison operations and table lookup operations .

4 is an example block diagram illustrating various types of processing circuit arrays of a computing device according to an embodiment of the present disclosure. The computing device shown in FIG. 4 includes a control circuit 102 , a main storage circuit 108 and a plurality of processing circuits 104 of different types. Optionally, the computing device may also include data handling circuitry 106 as shown in FIGS. 2 and 3 . In view of this, the computing device architecture shown in FIG. 4 is similar to the computing device architecture shown in FIGS. 2 and 3 , so the technical details of the computing device 200 described in conjunction with FIGS. 2 and 3 are also applicable to the computing device 200 shown in FIG. 4 . computing device shown.

As can be seen from FIG. 4 , the plurality of processing circuits of the present disclosure may include, for example, a plurality of first-type processing circuits and a plurality of second-type processing circuits (as shown in the figure, the processing circuits with different color backgrounds have different types) ). The plurality of processing circuits may be arranged through physical connections to form a two-dimensional array. It can be understood that the arrangement of the two types of processing circuits shown in FIG. 4 is merely exemplary and not limiting, and other arrangements may be conceived by those skilled in the art based on the teachings of the present disclosure. For example, a plurality of processing circuits of the first type may be arranged on the left and right sides of the array, and a plurality of processing circuits of the second type may be arranged in the middle area of the array. For another example, a plurality of first type processing circuits may be arranged in the middle area of the array, and a plurality of second type processing circuits may be arranged in the surrounding areas of the array. For another example, a plurality of first-type processing circuits and second-type processing circuits may also be interspersed in an array. According to different computing scenarios, the types of processing circuits disclosed in the present disclosure may not be limited to the two shown in the figures, but may have more types of processing circuits to implement different types of computing operations.

As shown in the figure, there are M rows and N columns (denoted as M*N) of first type processing circuits (processing circuits 104 with a light background shown in the figure) in the two-dimensional array, where M and N is a positive integer greater than 0. The first type of processing circuit can be used to perform arithmetic operations and logical operations, for example, can include linear operations such as addition, subtraction and multiplication, comparison operations and non-linear operations such as AND-OR, or any combination of the aforementioned types of operations. . Further, on the left and right sides of the periphery of the M*N first type processing circuit arrays, there are two columns, a total of (M*2+M*2) second type processing circuits, and on the lower side of the periphery thereof There are two rows and a total of (N*2+8) second-type processing circuits, that is, the processing circuit array has a total of (M*2+M*2+N*2+8) second-type processing circuits (as shown in the figure). The processing circuit 104 is shown with a dark background). In one embodiment, the second type of processing circuit may be used to perform non-linear operations such as comparison operations, table lookup operations or shift operations on the received data.

In some application scenarios, the storage circuits applied to both the first type of processing circuit and the second type of processing circuit may have different storage scales and storage modes. For example, the predicate storage circuit in the first type of processing circuit may utilize a plurality of numbered registers to store predicate information. Further, the first-type processing circuit can access the predicate information in the register of the corresponding number according to the register number specified in the received parsed instruction. As another example, the second type of processing circuit may store the predicate information in a static random access memory ("SRAM"). Specifically, the second type processing circuit may determine that the predicate information is stored in static random access memory ("SRAM") according to the offset of the location of the predicate information specified in the received parsed instruction storage address, and can perform predetermined read or write operations on the predicate information in the storage address.

5a, 5b, 5c and 5d are schematic diagrams illustrating various connection relationships of a plurality of processing circuits according to an embodiment of the present disclosure. The multiple processing circuits of the present disclosure may be connected in a one-dimensional or multi-dimensional array topology. When a plurality of processing circuits are connected in a multi-dimensional array, the multi-dimensional array may be a two-dimensional array, and the processing circuits located in the two-dimensional array may be arranged in a row direction, a column direction or a diagonal direction thereof. In at least one direction, it is connected with the remaining one or more of the processing circuits in the same row, the same column or the same diagonal line in a predetermined two-dimensional interval pattern. wherein the predetermined two-dimensional spacing pattern may be associated with the number of processing circuits spaced in the connection. Figures 5a to 5c exemplarily show the topology of various forms of two-dimensional arrays between a plurality of processing circuits.

As shown in Figure 5a, five processing circuits (each represented by a box) are connected to form a simple two-dimensional array. Specifically, one processing circuit is used as the center of the two-dimensional array, and one processing circuit is connected to each of the four horizontal and vertical directions relative to the processing circuit, thereby forming a two-dimensional array with three rows and three columns. . Further, since the processing circuits located in the center of the two-dimensional array are respectively directly connected with the processing circuits adjacent to the previous and next columns of the same row, and the processing circuits adjacent to the previous row and the next row of the same column, the number of spaced processing circuits ( abbreviated as "Number of Intervals") is 0.

As shown in Figure 5b, four rows and four columns of processing circuits can be connected to form a two-dimensional Torus array, where each processing circuit is connected to its adjacent processing circuits in the preceding and following rows, and the preceding and following columns, namely, The number of intervals connected to adjacent processing circuits is all zero. Further, the first processing circuit located in each row or column in the two-dimensional Torus array is also connected to the last processing circuit of the row or column, and the number of intervals between the processing circuits connected end to end in each row or column is equal to is 2.

As shown in FIG. 5c , the processing circuits with four rows and four columns may also be connected to form a two-dimensional array in which the number of intervals between adjacent processing circuits is 0, and the number of intervals between non-adjacent processing circuits is 1. Specifically, in the two-dimensional array, adjacent processing circuits in the same row or in the same column are directly connected, that is, the number of intervals is 0, and the processing circuits in the same row or in the same column that are not adjacent are connected to the processing circuit in the number of intervals. It can be seen that when a plurality of processing circuits are connected to form a two-dimensional array, there may be different numbers of intervals between the processing circuits in the same row or in the same column shown in FIG. 5b and FIG. 5c. Similarly, in some scenarios, different numbers of intervals may also be connected to the processing circuits in the diagonal direction.

As shown in Fig. 5d, using four two-dimensional Torus arrays as shown in Fig. 5b, four layers of two-dimensional Torus arrays can be arranged at predetermined intervals for connection to form a three-dimensional Torus array. The three-dimensional Torus array is based on the two-dimensional Torus array, and uses a spacing pattern similar to that between rows and columns for interlayer connection. For example, firstly, the processing circuits in the same row and the same column of adjacent layers are directly connected, that is, the number of intervals is 0. Next, connect the processing circuits of the first layer and the last layer in the same column, that is, the number of intervals is 2. Finally, a three-dimensional Torus array with four layers, four rows and four columns can be formed.

Through the above examples, those skilled in the art can understand that the connection relationship of other multi-dimensional arrays of processing circuits can be formed on the basis of two-dimensional arrays by adding new dimensions and increasing the number of processing circuits. In some application scenarios, the solutions of the present disclosure may also configure logical connections to processing circuits by using configuration instructions. In other words, although there may be hard-wired connections between processing circuits, the disclosed solution may selectively connect some processing circuits or selectively bypass some processing circuits through configuration instructions to form one or more processing circuits. a logical connection. In some embodiments, the aforementioned logical connections can also be adjusted according to actual operation requirements (eg, data type conversion). Further, for different computing scenarios, the solutions of the present disclosure can configure the connection of the processing circuits, including, for example, configuring into a matrix or configuring into one or more closed computing loops.

6a, 6b, 6c and 6d are schematic diagrams illustrating further various connection relationships of a plurality of processing circuits according to an embodiment of the present disclosure. As can be seen from the figures, FIGS. 6 a to 6 d are still another exemplary connection relationship of a multi-dimensional array formed by a plurality of processing circuits shown in FIGS. 5 a to 5 d . In view of this, the technical details described in conjunction with Figs. 5a to 5d also apply to the content shown in Figs. 6a to 6d.

As shown in Fig. 6a, the processing circuit of the two-dimensional array includes a central processing circuit located in the center of the two-dimensional array and three processing circuits respectively connected to the central processing circuit in four directions in the same row and in the same column. Therefore, the number of bays connected between the central processing circuit and the remaining processing circuits is 0, 1 and 2, respectively. As shown in Fig. 6b, the processing circuit of the two-dimensional array includes a central processing circuit located in the center of the two-dimensional array, three processing circuits in two opposite directions parallel to the processing circuit, and two processing circuits in the same column as the processing circuit A processing circuit in the opposite direction. Therefore, the number of intervals between the central processing circuit and the processing circuits in the same row is 0 and 2 respectively, and the number of intervals between the central processing circuit and the processing circuits in the same column is all 0.

As previously shown in conjunction with FIG. 5d, a multi-dimensional array formed by a plurality of processing circuits may be a three-dimensional array formed by a plurality of layers. Wherein each layer of the three-dimensional array may comprise a two-dimensional array of a plurality of the processing circuits arranged along its row and column directions. Further, the processing circuits located in the three-dimensional array may be in a predetermined three-dimensional spaced pattern with a row, column, diagonal or The remaining one or more processing circuits on different layers are connected. Further, the predetermined three-dimensional spacing pattern and the number of mutually spaced processing circuits in the connection may be related to the number of spaced layers. The connection mode of the three-dimensional array will be further described below with reference to FIG. 6c and FIG. 6d.

Figure 6c shows a multi-layer, multi-row and multi-column three-dimensional array formed by connecting a plurality of processing circuits. Taking the processing circuit located at the lth layer, the rth row, and the cth column (represented as (l, r, c)) as an example, it is located at the center of the array, and is in the same layer as the previous column (l, r, The processing circuit at c-1) and the processing circuit at the next column (l, r, c+1), the processing circuit at the previous row (l, r-1, c) of the same layer and the same column and the processing circuit at the next row (l, r-1, c) The processing circuit at r+1, c), and the processing circuit at the previous layer (l-1, r, c) and the processing circuit at the next layer (l+1, r, c) of different layers in the same column to connect. Further, the number of intervals at which the processing circuit at (l, r, c) is connected to other processing circuits in the row direction, the column direction and the layer direction are all zero.

FIG. 6d shows a three-dimensional array when the number of spaces connected between a plurality of processing circuits in the row direction, the column direction, and the layer direction is all one. Taking the processing circuit located at the center of the array (l, r, c) as an example, it is separated from (l, r, c-2) and (l, r, c+2) by one column before and after different columns in the same layer, respectively. ), and the processing circuits at (1, r-2, c) and (1, r+2, c) at the same layer and the same column and different rows are connected. Further, it is connected with the processing circuits at (l-2, r, c) and (l+2, r, c) at the same row and different layers in the same row before and after each other. Similarly, the processing circuits at (l, r, c-3) and (l, r, c-1) at the same level and one column apart are connected to each other, and (l, r, c+1) and ( The processing circuits at l, r, c+3) are connected to each other. Then, the processing circuits at (l, r-3, c) and (l, r-1, c) in the same layer and the same column are connected to each other, (l, r+1, c) and (l, r+ 3. The processing circuits at c) are connected to each other. In addition, the processing circuits at (l-3, r, c) and (l-1, r, c) in the same row and one layer are connected to each other, and (l+1, r, c) and (l+3) The processing circuits at , r, c) are connected to each other.

The connection relationship of the multi-dimensional array formed by a plurality of processing circuits has been exemplarily described above, and different loop structures formed by a plurality of processing circuits will be further exemplarily described below with reference to FIGS. 7-8 .

7a, 7b, 7c and 7d are schematic diagrams respectively illustrating various loop structures of processing circuits according to embodiments of the present disclosure. According to different application scenarios, a plurality of processing circuits can not only be connected in a physical connection relationship, but also can be configured to be connected in a logical relationship according to the received parsed instruction. The plurality of processing circuits may be configured to be connected using the logical connection relationship to form a closed loop.

As shown in Figure 7a, the four adjacent processing circuits are sequentially numbered "0, 1, 2 and 3". Next, the four processing circuits are sequentially connected in a clockwise direction from processing circuit 0, and processing circuit 3 is connected with processing circuit 0, so that the four processing circuits are connected in series to form a closed loop (referred to as "looping" for short). ). In this loop, the number of intervals between processing circuits is 0 or 2, eg, the number of intervals between

processing circuits

0 and 1 is 0, and the number of intervals between

processing circuits

3 and 0 is 2. Further, the physical addresses of the four processing circuits in the illustrated loop may be 0-1-2-3, while their logical addresses are also 0-1-2-3. It should be noted that the connection sequence shown in FIG. 7a is only exemplary and non-limiting, and those skilled in the art can also connect the four processing circuits in a counterclockwise direction in series to form a closed circuit according to actual calculation needs. the loop.

In some practical scenarios, when the data bit width supported by one processing circuit cannot meet the bit width requirement of the operation data, a plurality of processing circuits may be combined into a processing circuit group to represent one data. For example, suppose a processing circuit can handle 8-bit data. When 32-bit data needs to be processed, four processing circuits can be combined into a processing circuit group, so that four 8-bit data can be connected to form a 32-bit data. Further, one processing circuit group formed by the aforementioned four 8-bit processing circuits can serve as one processing circuit 104 shown in FIG. 7b, so that higher bit-width arithmetic operations can be supported.

It can be seen from Fig. 7b that the layout of the processing circuits shown is similar to that shown in Fig. 7a, but the number of intervals of connections between the processing circuits in Fig. 7b is different from that of Fig. 7a. Figure 7b shows four processing circuits numbered sequentially 0, 1, 2 and 3 starting from processing circuit 0 in a clockwise direction, connecting processing circuit 1, processing circuit 3 and processing circuit 2 in sequence, and processing circuit 2 connected to processing circuit 2. circuit 0, thus forming a closed loop in series. It can be seen from this loop that the number of intervals of the processing circuits shown in FIG. 7b is 0 or 1, eg, the interval between

processing circuits

0 and 1 is 0, and the interval between

processing circuits

1 and 3 is 1. Further, the physical addresses of the four processing circuits in the illustrated closed loop may be 0-1-2-3, while the logical addresses may be 0-1-3-2. Therefore, when data of high bit width needs to be split to be allocated to different processing circuits, the data sequence can be rearranged and allocated according to the logical addresses of the processing circuits.

The operations of splitting and rearranging described above may be performed by the pre-processing circuit described in conjunction with FIG. 2 . In particular, the pre-processing circuit can rearrange the input data according to the physical addresses and logical addresses of the plurality of processing circuits, so as to meet the requirements of data operation. Assuming that four sequentially arranged processing circuits 0 to 3 are connected as shown in Figure 7a, since the physical and logical addresses of the connections are both 0-1-2-3, the pre-processing circuit can convert the input data ( For example, pixel data) aa0, aa1, aa2 and aa3 are sequentially transmitted to the corresponding processing circuits. However, when the aforementioned four processing circuits are connected as shown in Figure 7b, their physical addresses remain unchanged from 0-1-2-3, while their logical addresses become 0-1-3-2. The circuit needs to rearrange the input data aa0, aa1, aa2 and aa3 into aa0-aa1-aa3-aa2 for transmission to the corresponding processing circuit. Based on the above-mentioned rearrangement of the input data, the solution of the present disclosure can ensure the correctness of the data operation sequence. Similarly, if the order of the four operation output results (for example, pixel data) obtained above is bb0-bb1-bb3-bb2, the post-processing circuit described in conjunction with FIG. 2 can be used to restore and adjust the order of the operation output results to bb0- bb1-bb2-bb3, to ensure the consistency of arrangement between input data and output result data.

Figures 7c and 7d show that more processing circuits are arranged and connected in different ways, respectively, to form a closed loop. As shown in FIG. 7c, the 16 processing circuits 104 numbered in the order of 0, 1 . . . 15, starting from processing circuit 0, are sequentially connected and combined every two processing circuits to form a processing circuit group. For example, as shown in the figure, processing circuit 0 is connected with processing circuit 1 to form a processing circuit group . . . By analogy, the processing circuit 14 is connected with the processing circuit 15 to form one processing circuit group, and finally eight processing circuit groups are formed. Further, the eight processing circuit groups can also be connected in a manner similar to the aforementioned processing circuits, including connection according to, for example, predetermined logical addresses, so as to form a closed loop of the processing circuit groups.

As shown in FIG. 7d, the plurality of processing circuits 104 are connected in an irregular or non-uniform manner to form a closed loop. Specifically, it is shown in FIG. 7d that the number of intervals between the processing circuits can be 0 or 3 to form a closed loop, for example, the processing circuit 0 can be respectively connected with the processing circuit 1 (the interval number is 0) and the processing circuit 4 (the interval number is 0) The number is 3) connected.

As can be seen from the above description in conjunction with FIGS. 7a, 7b, 7c and 7d, the processing circuit of the present disclosure may be spaced by different numbers of processing circuits so as to be connected in a closed loop. When the total number of processing circuits changes, any number of intermediate intervals can also be selected for dynamic configuration, thereby connecting into a closed loop. It is also possible to combine a plurality of processing circuits into a processing circuit group and connect them into a closed loop of the processing circuit group. In addition, the connection of the plurality of processing circuits may be a hard connection formed by hardware, or may be a soft connection configured by software.

Figures 8a, 8b and 8c are schematic diagrams illustrating further various loop structures of processing circuits according to embodiments of the present disclosure. As shown in conjunction with FIG. 6, multiple processing circuits may form a closed loop, and each processing circuit in the closed loop may be configured with a respective logical address. Further, the pre-processing circuit described in conjunction with FIG. 2 can be configured to perform corresponding splitting of the operational data and obtain after the splitting according to the type of the operational data (such as 32bit data, 16bit data or 8bit data) and the logical address. The multiple sub-data of , are respectively transferred to the corresponding processing circuits in the loop for subsequent operations.

The upper diagram of FIG. 8a shows that four processing circuits are connected to form a closed loop, and the physical addresses (which may also be referred to as physical coordinates in the context of this disclosure) of the four processing circuits in right-to-left order can be represented as 0 -1-2-3. The lower diagram of Figure 8a shows that the logical addresses of the four processing circuits in the aforementioned loop are represented as 0-3-1-2 in order from right to left. For example, the processing circuit with the logical address "3" shown in the lower diagram of Fig. 8a has the physical address "1" shown in the upper diagram of Fig. 8a.

In some application scenarios, it is assumed that the granularity of the operation data is the lower 128 bits of the input data, such as the original sequence "15, 14, ... 2, 1, 0" in the figure (each number corresponds to 8 bits of data), and set this The logical addresses of the 16 8-bit data are numbered from low to high in order from 0 to 15. Further, according to the logical addresses shown in the lower figure of Fig. 8a, the pre-processing circuit can use different logical addresses to encode or arrange the data according to different data types.

When the data bit width of the processing circuit operation is 32bit, the logical addresses are (3,2,1,0), (7,6,5,4), (11,10,9,8) and (15,14) , 13, 12) can represent the 0th to 3rd 32bit data respectively. The preprocessing circuit can transmit the 0th 32-bit data to the processing circuit whose logical address is "0" (the corresponding physical address is "0"), and can transmit the first 32-bit data to the logical address "1". In the processing circuit (corresponding physical address is "2"), the second 32-bit data can be transferred to the processing circuit whose logical address is "2" (corresponding physical address is "3"), and the third The 32bit data is sent to the processing circuit whose logical address is "3" (the corresponding physical address is "1"). Through the rearrangement of data, it is used to meet the subsequent operation requirements of the processing circuit. Therefore, the mapping relationship between the logical address and the physical address of the final data is (15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)-> (11,10,9,8,7,6,5,4,15,14,13,12,3,2,1,0).

When the data bit width of the processing circuit operation is 16bit, the logical addresses are (1,0), (3,2), (5,4), (7,6), (9,8), (11,10) ), (13,12) and (15,14) 8 numbers can represent the 0th to 7th 16bit data respectively. The pre-processing circuit can transfer the 0th and 4th 16bit data to the processing circuit whose logical address is "0" (the corresponding physical address is "0"), and can transfer the 1st and 5th 16bit data. Transfer to the processing circuit with logical address "1" (corresponding physical address is "2"), the second and sixth 16bit data can be transferred to the processing circuit with logical address "2" (corresponding physical address is "2") The address is "3"), and the third and seventh 16-bit data can be transferred to the processing circuit whose logical address is "3" (the corresponding physical address is "1"). Therefore, the mapping relationship between the logical address and the physical address of the final data is:

(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0)->(13,12,5,4,11,10,3, 2,15,14,7,6,9,8,1,0).

When the data bit width operated by the processing circuit is 8 bits, the 16 numbers whose logical addresses are 0 to 15 can respectively represent the 0th to 15th 8bit data. According to the connection shown in FIG. 8a, the pre-processing circuit can transmit the 0th, 4th, 8th and 12th 8-bit data to the processing circuit whose logical address is “0” (the corresponding physical address is "0"); the 1st, 5th, 9th and 13th 8bit data can be transferred to the processing circuit whose logical address is "1" (the corresponding physical address is "2"); The 2nd, 6th, 10th and 14th 8bit data are transferred to the processing circuit with the logical address "2" (the corresponding physical address is "3"); the third, seventh, The 11th and 15th 8bit data are transferred to the processing circuit whose logical address is "3" (the corresponding physical address is "1"). Therefore, the mapping relationship between the logical address and the physical address of the final data is: (15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)- >(14,19,6,2,13,9,5,1,15,11,7,3,12,8,4,0).

Figure 8b shows that eight sequentially numbered processing circuits 0 to 7 are connected to form a closed loop, and the physical addresses of the eight processing circuits are 0-1-2-3-4-5-6- 7. The lower diagram of Fig. 8b shows that the logical addresses of the aforementioned eight processing circuits are 0-7-1-6-2-5-3-4. For example, the processing circuit with the physical address "6" shown in the upper diagram of Fig. 8b corresponds to the logical address "3" shown in the lower diagram of Fig. 8b.

For different data types shown in FIG. 8b, the pre-processing circuit rearranges the data and then transmits the data to the corresponding processing circuit. The operation is similar to that of FIG. 8a, so the technical solution described in conjunction with FIG. 8a is also applicable to FIG. 8b. , the above data rearrangement operation process will not be repeated here. Further, the connection relationship of the processing circuits shown in FIG. 8b is similar to that shown in FIG. 8a, but the eight processing circuits shown in FIG. 8b are twice the number of processing circuits shown in FIG. 8a. Thus, in an application scenario where operations are performed according to different data types, the granularity of the operational data described in conjunction with FIG. 8b may be twice that of the operational data described in conjunction with FIG. 8a. Therefore, compared with the granularity of the input data in the previous example, which is 128 bits lower, the granularity of the operation data in this example can be 256 bits lower than that of the input data, for example, the original data sequence "31, 30, . , 0", each digit corresponds to an 8-bit ("bit") length.

For the above-mentioned original data sequence, when the data bit widths operated by the processing circuits are 32 bits, 16 bits, and 8 bits, respectively, the figures also show the arrangement results of the data in the looped processing circuits. For example, when the data bit width of the operation is 32 bits, one 32-bit data in the processing circuit whose logical address is "1" is (7, 6, 5, 4), and the corresponding physical address of this processing circuit is "2". When the data bit width of the operation is 16 bits, the two 16-bit data in the processing circuit whose logical address is "3" is (23, 22, 7, 6), and the corresponding physical address of the processing circuit is "6". When the data bit width of the operation is 8 bits, the four 8-bit data in the processing circuit whose logical address is "6" is (30, 22, 14, 6), and the corresponding physical address of this processing circuit is "3".

The data operations of different data types are described above in conjunction with the case where a plurality of single-type processing circuits shown in FIG. 8a and FIG. 8b (the first type processing circuit shown in FIG. 3 ) are connected to form a closed loop. The following will be combined with a plurality of different types of processing circuits shown in FIG. 8c (the first type of processing circuit and the second type of processing circuit shown in FIG. 4) to form a closed loop. For data operations of different data types make a further description.

The upper figure in FIG. 8c shows that twenty multi-type processing circuits numbered in the order of 0, 1 . . . 19 are connected to form a closed loop (the numbers shown in the figure are the physical addresses of the processing circuits). Sixteen processing circuits numbered from 0 to 15 are first type processing circuits, and four processing circuits numbered from 16 to 19 are second type processing circuits. Similarly, the physical address of each of the twenty processing circuits has a mapping relationship with the logical address of the corresponding processing circuit shown in the lower figure of FIG. 8c.

Further, when operating on different data types, for example, for the 80 8-bit original sequences shown in the figure, FIG. 8c also shows the result of operating the aforementioned original data for different data types supported by the processing circuit. For example, when the data bit width of the operation is 32 bits, one 32-bit data in the processing circuit whose logical address is "1" is (7, 6, 5, 4), and the corresponding physical address of this processing circuit is "2". When the data bit width of the operation is 16 bits, the two 16-bit data in the processing circuit whose logical address is "11" are (63, 62, 23, 22), and the corresponding physical address of the processing circuit is "9". When the data bit width of the operation is 8 bits, the four 8-bit data in the processing circuit whose logical address is "17" is (77, 57, 37, 17), and the corresponding physical address of the processing circuit is "18".

9a, 9b, 9c and 9d are schematic diagrams illustrating data stitching operations performed by a preprocessing circuit according to an embodiment of the present disclosure. As mentioned above, the pre-processing circuit described in the present disclosure in conjunction with FIG. 2 can also be configured to select a data splicing mode from a plurality of data splicing modes according to the parsed instruction, so as to perform a splicing operation on the two input data. Regarding multiple data splicing modes, in one embodiment, the solution of the present disclosure divides and numbers the two data to be spliced according to the minimum data unit, and then extracts different minimum data units of the data based on specified rules to form different data. Stitching mode. For example, based on the parity of the number or whether the number is an integer multiple of a specified number, for example, alternately extracting and placing can be performed to form different data splicing patterns. According to different calculation scenarios (such as different data bit widths), the minimum data unit here can be simply 1-bit or 1-bit data, or 2-bit, 4-bit, 8-bit, 16-bit or 32-bit or the length of the bit . Further, when extracting different numbered parts of the two data, the scheme of the present disclosure can either extract alternately with the smallest data unit, or can extract in multiples of the smallest data unit, for example, alternately extract two data at a time from the two data. Partial data of one or three minimum data units are grouped together as a group.

Based on the above description of the data splicing mode, the following will illustrate the data splicing mode of the present disclosure with specific examples in conjunction with FIGS. 9 a to 9 c . In the figure shown, the input data are In1 and In2, and when each square in the figure represents a minimum data unit, both input data have a bit width length of 8 minimum data units. As mentioned above, for data of different bit widths and lengths, the minimum data unit may represent different number of bits (or number of bits). For example, for data with a bit width of 8 bits, the smallest data unit represents 1-bit data, and for data with a bit width of 16 bits, the smallest data unit represents 2-bit data. For another example, for data with a bit width of 32 bits, the minimum data unit represents 4-bit data.

As shown in FIG. 9a, the two input data In1 and In2 to be spliced are each composed of eight minimum data units sequentially numbered 1, 2, . . . , 8 from right to left. Data splicing is performed according to the principle of parity interleaving with numbers from small to large, In1 followed by In2, and odd numbers followed by even numbers. Specifically, when the data bit width of the operation is 8 bits, the data In1 and In2 each represent one 8-bit data, and each minimum data unit represents 1-bit data (ie, one square represents 1-bit data). According to the bit width of the data and the aforementioned splicing principle, the minimum data units numbered 1, 3, 5 and 7 of the data In1 are first extracted and arranged in the lower order. Next, four odd-numbered minimum data units of the data In2 are sequentially arranged. Similarly, the smallest data units of the data In1 numbered 2, 4, 6, and 8 and the four even-numbered smallest data units of the data In2 are sequentially arranged. Finally, one 16-bit or two 8-bit new data is formed by splicing 16 minimum data units, as shown in the second row of squares in Figure 9a.

As shown in Figure 9b, when the data bit width is 16 bits, the data In1 and In2 each represent a 16-bit data, and each minimum data unit at this time represents 2-bit data (ie, a square represents a 2-bit data). According to the bit width of the data and the aforementioned principle of interleaving and splicing, the minimum data units numbered 1, 2, 5 and 6 of the data In1 can be extracted first and arranged in the lower order. Then, the smallest data units numbered 1, 2, 5, and 6 of the data In2 are sequentially arranged. Similarly, the minimum data units numbered 3, 4, 7 and 8 and the data In2 are sequentially arranged to form a 32-bit or 2 16-bit new data composed of the final 16 minimum data units. , as shown in the second row of squares in Figure 9b.

As shown in Figure 9c, when the data bit width is 32 bits, the data In1 and In2 each represent a 32-bit data, and each minimum data unit represents 4-bit data (ie, a square represents a 4-bit data). According to the bit width of the data and the aforementioned principle of interleaving and splicing, the smallest data units with the same numbers as the data In1 and the same numbers as the data In2 can be extracted and arranged in the lower order. Then, extract the smallest data units numbered 5, 6, 7, and 8 with the same numbers as the data In2 and arrange them in sequence, thereby splicing to form a 64-bit or two 32-bit new data consisting of the final 16 smallest data units .

Exemplary data splicing manners of the present disclosure are described above in conjunction with FIGS. 9a-9c. However, it can be understood that in some computing scenarios, data splicing does not involve the above-mentioned staggered arrangement, but only a simple arrangement of two data while keeping their original data positions unchanged, such as shown in Figure 9d out. It can be seen from Figure 9d that the two data In1 and In2 do not perform the interleaving as shown in Figures 9a-9c, but only the last minimum data unit of the data In1 and the first minimum data unit of In2 The data units are concatenated to obtain a new data type with an increased (eg doubled) bit width. In some scenarios, the solution of the present disclosure can also perform group stitching based on data attributes. For example, neuron data or weight data with the same feature map can be formed into a group and then arranged to form a continuous part of the spliced data.

10a, 10b and 10c are schematic diagrams illustrating data compression operations performed by post-processing circuits according to embodiments of the present disclosure. The compressing operation may include filtering the data with a mask or compressing by comparing a given threshold with the size of the data. Regarding data compression operations, it can be divided and numbered in the smallest data unit as previously described. Similar to that described in connection with Figures 9a-9d, the minimum data unit may be, for example, 1-bit or 1-bit data, or a length of 2, 4, 8, 16 or 32 bits or bits. Exemplary descriptions for different data compression modes will be made below in conjunction with Figures 10a to 10c.

As shown in Figure 10a, the original data consists of eight squares (ie, eight minimum data units) sequentially numbered 1, 2..., 8 from right to left, assuming that each minimum data unit can represent 1 bit data. When performing the data compression operation according to the mask, the post-processing circuit may filter the original data by using the mask to perform the data compression operation. In one embodiment, the bit width of the mask corresponds to the number of minimum data units of the original data. For example, if the aforementioned original data has 8 minimum data units, the bit width of the mask is 8 bits, and the minimum data unit numbered 1 corresponds to the lowest bit of the mask, and the minimum data unit numbered 2 corresponds to the next low. And so on, the smallest data unit numbered 8 corresponds to the most significant bit of the mask. In an application scenario, when the 8-bit mask is "10010011", the compression principle may be set to extract the smallest data unit in the original data corresponding to the data bit whose mask is "1". For example, the numbers of the smallest data units corresponding to the mask value "1" are 1, 2, 5, and 8. Thus, the minimum data units numbered 1, 2, 5 and 8 can be extracted and arranged in order from low to high to form new compressed data, as shown in the second row of Figure 10a.

Fig. 10b shows the original data similar to Fig. 10a, and it can be seen from the second row of Fig. 10b that the data sequence passed through the post-processing circuit maintains the original data arrangement order and content. It will thus be appreciated that the data compression of the present disclosure may also include a disabled mode or a non-compressed mode so that no compression operation is performed when the data passes through the post-processing circuit.

As shown in Figure 10c, the original data consists of eight squares arranged in sequence, the number above each square indicates its number, and the order from right to left is 1, 2...8, and it is assumed that each minimum data unit can be is 8-bit data. Further, the number in each square represents the decimal value of that smallest data unit. Taking the smallest data unit numbered 1 as an example, its decimal value is "8", and the corresponding 8-bit data is "00001111". When performing data compression operation according to the threshold, assuming that the threshold is decimal data "8", the compression principle can be set to extract all the smallest data units in the original data that are greater than or equal to the threshold "8". Thus, the smallest data units numbered 1, 4, 7 and 8 can be extracted. Then, arrange all the extracted minimum data units in descending order of numbers to obtain the final data result, as shown in the second row in Figure 10c.

11 is a simplified flow diagram illustrating a method 1100 of performing computational operations using a computing device, which may have the hardware architecture described in connection with FIGS. 1-4, according to an embodiment of the present disclosure.

As shown in FIG. 11, at step 1110, the method 1100 may utilize the control circuit to obtain an instruction, and may parse the instruction, and send the parsed instruction to one of the plurality of processing circuits or Multiple processing circuits. In one embodiment, the control circuit may determine one or more processing circuits that perform an operation according to the instruction identification information in the instruction, and send the parsed instruction to one of the plurality of processing circuits. one or more to perform the corresponding operation specified by the parsed instruction.

In one or more embodiments, in the process of parsing the instruction, the control circuit may perform a decoding operation on the instruction, and send the parsed instruction to the instruction according to the decoding result. one or more of a plurality of processing circuits. When multiple processing circuits support non-specific operations of the same type, the control circuit can send parsed instructions to the processing circuits with low usage occupancy or in an idle state according to the operating states of the multiple processing circuits. Further, according to different application scenarios, the parsed instruction may also be an parsed instruction that has not been decoded by the control circuit. The one or more processing circuits may include corresponding decoding circuits to decode the received parsed instructions, for example, to generate multiple micro-instructions, so that one or more processing circuits can decode the received instructions according to the micro-instructions. Perform subsequent operations.

Next, flow may proceed to step 1120, where method 1100 may utilize the one or more processing circuits to perform multi-threaded operations according to the parsed instructions. In one embodiment, the plurality of processing circuits may be configured to receive and execute the parsed instructions in a single instruction multithreading ("SIMT") fashion. In another embodiment, the plurality of processing circuits may be connected in a one-dimensional or multi-dimensional array topology, and the plurality of processing circuit arrays connected in series through the connection may form one or more closed loops. In yet another embodiment, a plurality of processing circuits may determine whether to execute the operation specified by the parsed instruction according to the received information (eg, predicate information) in the parsed instruction.

FIG. 12 is a structural diagram illustrating a combined processing apparatus 1200 according to an embodiment of the present disclosure. As shown in FIG. 12, the combined processing device 1200 includes a computing processing device 1202, an interface device 1204, other processing devices 1206, and a storage device 1208. According to different application scenarios, one or more computing devices 1210 may be included in the computing processing device, and the computing devices may be configured to perform the operations described herein in conjunction with FIG. 1 to FIG. 11 .

In various embodiments, the computing processing devices of the present disclosure may be configured to perform user-specified operations. In an exemplary application, the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core. When multiple computing devices are implemented as an artificial intelligence processor core or a part of the hardware structure of an artificial intelligence processor core, for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure.

In an exemplary operation, the computing processing apparatus of the present disclosure may interact with other processing apparatuses through an interface apparatus to jointly complete an operation specified by a user. According to different implementations, other processing devices of the present disclosure may include central processing units (Central Processing Unit, CPU), graphics processing units (Graphics Processing Unit, GPU), artificial intelligence processors and other general-purpose and/or special-purpose processors. One or more types of processors. These processors may include, but are not limited to, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs. As mentioned above, only for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when computing processing devices and other processing devices are considered together, the two can be viewed as forming a heterogeneous multi-core structure.

In one or more embodiments, the other processing device may serve as an interface for the computing processing device of the present disclosure (which may be embodied as a related computing device for artificial intelligence such as neural network operations) with external data and control, performing operations including but not limited to Limited to basic controls such as data movement, starting and/or stopping computing devices. In other embodiments, other processing apparatuses may also cooperate with the computing processing apparatus to jointly complete computing tasks.

In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing and processing device may obtain input data from other processing devices via the interface device, and write the input data into the on-chip storage device (or memory) of the computing and processing device. Further, the computing and processing device may obtain control instructions from other processing devices via the interface device, and write them into a control cache on the computing and processing device chip. Alternatively or alternatively, the interface device can also read the data in the storage device of the computing processing device and transmit it to other processing devices.

Additionally or alternatively, the combined processing device of the present disclosure may also include a storage device. As shown in the figure, the storage device is connected to the computing processing device and the other processing device, respectively. In one or more embodiments, a storage device may be used to store data of the computing processing device and/or the other processing device. For example, the data may be data that cannot be fully stored in an internal or on-chip storage device of a computing processing device or other processing device.

In some embodiments, the present disclosure also discloses a chip (eg, chip 1302 shown in FIG. 13 ). In one implementation, the chip is a System on Chip (SoC) and integrates one or more combined processing devices as shown in FIG. 12 . The chip can be connected with other related components through an external interface device (such as the external interface device 1306 shown in FIG. 13 ). The relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card or a wifi interface. In some application scenarios, other processing units (such as video codecs) and/or interface modules (such as DRAM interfaces) may be integrated on the chip. In some embodiments, the present disclosure also discloses a chip package structure including the above-mentioned chip. In some embodiments, the present disclosure also discloses a board including the above-mentioned chip package structure. The board will be described in detail below with reference to FIG. 13 .

FIG. 13 is a schematic structural diagram illustrating a board 1300 according to an embodiment of the present disclosure. As shown in FIG. 13 , the board includes a storage device 1304 for storing data, which includes one or more storage units 1310 . The storage device can be connected and data transferred with the control device 1308 and the chip 1302 described above through, for example, a bus. Further, the board also includes an external interface device 1306, which is configured for data relay or transfer function between the chip (or a chip in a chip package structure) and an external device 1312 (such as a server or a computer, etc.). For example, the data to be processed can be transmitted to the chip by an external device through an external interface device. For another example, the calculation result of the chip may be transmitted back to the external device via the external interface device. According to different application scenarios, the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface and the like.

In one or more embodiments, the control device in the board of the present disclosure may be configured to regulate the state of the chip. To this end, in an application scenario, the control device may include a single-chip microcomputer (Micro Controller Unit, MCU) for regulating the working state of the chip.

According to the above description in conjunction with FIG. 12 and FIG. 13 , those skilled in the art can understand that the present disclosure also discloses an electronic device or device, which may include one or more of the above-mentioned boards, one or more of the above-mentioned chips and/or one or a plurality of the above-mentioned combined processing devices.

According to different application scenarios, the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment. The vehicles include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph. The electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal. In one or more embodiments, the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.

It should be noted that, for the purpose of simplicity, the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also focuses on the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.

In terms of specific implementation, based on the disclosure and teaching of this disclosure, those skilled in the art can understand that several embodiments disclosed in this disclosure can also be implemented in other ways not disclosed herein. For example, as for each unit in the foregoing electronic device or apparatus embodiment, this article divides them on the basis of considering logical functions, and there may also be other division methods in actual implementation. As another example, multiple units or components may be combined or integrated into another system, or some features or functions of a unit or component may be selectively disabled. As far as the connection relationship between different units or components is concerned, the connections discussed above in conjunction with the accompanying drawings may be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In this disclosure, units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units. The aforementioned components or elements may be co-located or distributed over multiple network elements. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.

In some implementation scenarios, the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the aspects of the present disclosure are embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure. The aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.

In other implementation scenarios, the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors. In this regard, the various types of devices described herein (eg, computing devices or other processing devices) may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.

The foregoing can be better understood in accordance with the following terms:

Clause 1. A computing device comprising a control circuit and a plurality of processing circuits, wherein:

the control circuit is configured to obtain and parse the instruction, and send the parsed instruction to one or more of the plurality of processing circuits; and

The plurality of processing circuits are configured to be connected in a one-dimensional or multi-dimensional array structure and to perform multi-threaded operations in accordance with the received parsed instructions.

Clause 2. The computing device of clause 1, wherein in parsing the instruction, the control circuit is configured to:

obtaining instruction identification information in the instruction; and

The parsed instruction is sent to one or more of the plurality of processing circuits according to the instruction identification information.

Clause 3. The computing device of clause 1, wherein in parsing the instruction, the control circuit is configured to:

decoding the instruction; and

The parsed instruction is sent to one or more of the plurality of processing circuits according to the result of the decoding and the operating state of the plurality of processing circuits.

Clause 4. The computing device of clause 1, wherein the plurality of processing circuits are divided into multiple types of processing circuits to perform different types of data processing.

Clause 5. The computing device of clause 1, wherein the plurality of processing circuits are divided into first type processing circuits and second type processing circuits, wherein the first type processing circuits are adapted to perform at least arithmetic operations and logic at least one of an operation, and the second type of processing circuit is adapted to perform at least one of a comparison operation and a table look-up operation.

Clause 6. The computing device of clause 1, wherein the multidimensional array is a two-dimensional array and the processing circuits located in the two-dimensional array are in at least one of a row, column, or diagonal direction thereof. The above is connected with the remaining one or more of the processing circuits in the same row, the same column or the same diagonal in a predetermined two-dimensional interval pattern.

Clause 7. The computing device of clause 6, wherein the predetermined two-dimensional spacing pattern is associated with a number of processing circuits spaced in the connection.

Clause 8. The computing device of clause 1, wherein the multidimensional array is a three-dimensional array of a plurality of layers, wherein each layer includes a plurality of the processes arranged in row, column, and diagonal directions A two-dimensional array of circuits, where:

The processing circuits located in the three-dimensional array are in a predetermined three-dimensional spaced pattern in at least one of row, column, diagonal, and layer directions with those on the same row, column, diagonal, or different layer. The remaining one or more processing circuits are connected.

Clause 9. The computing device of clause 8, wherein the predetermined three-dimensional spacing pattern is associated with a number of spacings and layers of spacing between processing circuits to be connected.

Clause 10. The computing device of any of clauses 6-9, wherein the plurality of processing circuits are configured to be connected by logical connections to form one or more closed loops.

Clause 11. The computing device of clause 10, wherein the plurality of processing circuits are configured to determine from the parsed instructions whether to connect by logical connections to form one or more closed loops.

Clause 12. The computing device of clause 1, wherein a plurality of the processing circuits are configured to form at least one group of processing circuits to process data according to a bit width of the received data.

Clause 13. The computing device of clause 12, wherein when a plurality of the processing circuit groups are formed to process data, the plurality of processing circuit groups are connected by logical connections according to the parsed instructions to form One or more closed loops.

Clause 14. The computing device of clause 1, wherein each of the processing circuits comprises:

a logic operation circuit configured to perform a logic operation according to the parsed instruction and the received data when performing the multithreaded operation; and

A storage circuit including a data storage circuit, wherein the data storage circuit is configured to store at least one of operation data and intermediate operation results of the processing circuit.

Clause 15. The computing device of clause 14, wherein the storage circuit further comprises a predicate storage circuit, wherein the predicate storage circuit is configured to store a predicate storage for each of the processing circuits obtained using the parsed instruction Circuit number and predicate information.

Clause 16. The computing device of clause 15, wherein the predicate storage circuit is further configured to:

update the predicate information according to the parsed instruction; or

The predicate information is updated according to the operation result of each of the processing circuits.

Clause 17. The computing device of clause 15, wherein each of the processing circuits is configured to:

obtaining the predicate information corresponding to the predicate storage circuit according to the predicate storage circuit sequence number in the parsed instruction; and

Whether the processing circuit executes the parsed instruction is determined according to the predicate information.

Clause 18. The computing device of clause 1, wherein the processing circuit further comprises an arithmetic operation circuit configured to perform arithmetic operation operations.

Clause 19. The computing device of clause 8, further comprising:

A data processing circuit comprising at least one of a pre-processing circuit and a post-processing circuit, wherein the pre-processing circuit is configured to perform a preprocessing operation on operation data before the processing circuit performs the operation, and the post-processing circuit is configured to After the processing circuit performs the operation, a post-processing operation is performed on the operation result.

Clause 20. The computing device of clause 19, wherein each of the plurality of processing circuits in the closed loop is configured with a respective logical address, the pre-processing circuits being configured to operate according to the type and logic of the data address, the operation data is divided accordingly, and the multiple sub-data obtained after the division are respectively transmitted to the corresponding processing circuits in the loop for operation.

Clause 21. The computing device of Clause 19, wherein the preprocessing circuit is further configured to select a data splicing mode from a plurality of data splicing modes according to the parsed instruction to perform a splicing operation on the two input data .

Clause 22. The computing device of clause 21, wherein the post-processing circuit is further configured to perform a compression operation on the data, the compression operation comprising filtering the data with a mask or by comparing a given threshold to a data size. to filter.

Clause 23. The computing device of clause 1, further comprising:

A main storage circuit, the main storage circuit includes at least one of a main storage module and a main cache module, wherein the main storage module is configured to store the data used for performing the operation in the processing circuit and the operation result after the operation is performed, and the The main cache module is configured to cache the intermediate operation result after the operation is performed in the processing circuit.

Clause 24. The computing device of any of clauses 1-9 or 11-23, wherein the plurality of processing circuits are configured to receive and execute the parsed instructions in a SIMT manner.

Clause 25. An integrated circuit chip comprising the computing device of any of clauses 1-24.

Clause 26. A board comprising the integrated circuit chip of clause 25.

Clause 27. A method of performing an arithmetic operation using a computing device, wherein the computing device includes a control circuit and a plurality of processing circuits connected in a one-dimensional or multi-dimensional array structure, the method comprising:

Utilize the control circuit to obtain and parse the instruction, and send the parsed instruction to one or more of the plurality of processing circuits; and

The one or more processing circuits are utilized to perform multi-threaded operations in accordance with the parsed instructions.

Clause 28. The method of clause 27, wherein in parsing the instruction, the method utilizes the control circuit to perform:

obtaining instruction identification information in the instruction; and

Clause 29. The method of clause 27, wherein in parsing the instruction, the method utilizes the control circuit to perform:

decoding the instruction; and

Clause 30. The method of clause 27, comprising dividing the plurality of processing circuits into multiple types of processing circuits to perform different types of data processing.

Clause 31. The method of clause 27, wherein dividing the plurality of processing circuits into a plurality of types of processing circuits comprises dividing the plurality of processing circuits into a first type of processing circuits and a second type of processing circuits, wherein The first type of processing circuit is adapted to perform at least one of an arithmetic operation and a logical operation, and the second type of processing circuit is adapted to perform at least one of a comparison operation and a table look-up operation.

Clause 32. The method of clause 27, wherein the multidimensional array is a two-dimensional array, and the method comprises placing the processing circuits located in the two-dimensional array in its row, column, or diagonal directions At least one of the directions is connected to the remaining one or more of the processing circuits in a row, column or diagonal in a predetermined two-dimensional spaced pattern.

Clause 33. The method of clause 32, wherein the predetermined two-dimensional spacing pattern is associated with a number of processing circuits spaced in the connection.

Clause 34. The method of clause 27, wherein the multidimensional array is a three-dimensional array composed of a plurality of layers, wherein each layer includes a plurality of the processing circuits arranged in row, column, and diagonal directions The two-dimensional array, the method includes:

Aligning the processing circuits located in the three-dimensional array in at least one of the row, column, diagonal, and layer directions with a row, column, diagonal, or layer in a predetermined three-dimensional spacing pattern connected to one or more of the remaining processing circuits.

Clause 35. The method of clause 34, wherein the predetermined three-dimensional spacing pattern is associated with a number of spacings and layers of spacing between processing circuits to be connected.

Clause 36. The method of any of clauses 32-35, comprising connecting the plurality of processing circuits through logical connections to form one or more closed loops.

Clause 37. The method of clause 36, wherein the method comprises determining from the parsed instructions whether to connect the plurality of processing circuits by logical connections to form one or more closed loops.

Clause 38. The method of clause 27, wherein a plurality of said processing circuits are configured to form at least one group of processing circuits to process data according to a bit width of the received data.

Clause 39. The method of clause 38, wherein when a plurality of the processing circuit groups are formed to process data, the method comprises connecting the plurality of processing circuit groups by logical connections according to the parsed instructions , to form one or more closed loops.

Clause 40. The method of clause 27, wherein each of the processing circuits includes a logic operation circuit and a storage circuit, wherein the storage circuit includes a data storage circuit, wherein the method includes when performing the multithreaded operation, The logic operation circuit is used to perform a logic operation according to the parsed instruction and the received data, and the data storage circuit is used to store at least one of operation data and an intermediate operation result of the processing circuit.

Clause 41. The method of clause 40, wherein the storage circuit further comprises a predicate storage circuit, wherein the method comprises using the predicate storage circuit to store each of the processing circuits fetched using the parsed instruction The predicate stores the circuit number and predicate information.

Clause 42. The method of clause 41, further comprising utilizing the predicate storage circuit to perform the following steps:

update the predicate information according to the parsed instruction; or

Clause 43. The method of clause 41, further comprising utilizing each of said processing circuits to perform the following steps:

Clause 44. The method of clause 27, wherein the processing circuit further comprises an arithmetic operation circuit, the method comprising utilizing the arithmetic operation circuit to perform an arithmetic operation operation.

Clause 45. The method of clause 34, wherein the computing device further comprises a data processing circuit comprising at least one of a pre-processing circuit and a post-processing circuit, wherein the method comprises, before the processing circuit performs an operation, The preprocessing circuit is used to perform a preprocessing operation on the operation data, and after the processing circuit performs the operation, the postprocessing circuit is used to perform a postprocessing operation on the operation result.

Clause 46. The method of clause 45, wherein each of the plurality of processing circuits in the closed loop is configured with a respective logical address, the method comprising utilizing the pre-processing circuit to perform an operation according to an operation of the data. type and logical address, the operation data is divided accordingly, and the multiple sub-data obtained after the division are respectively transmitted to the corresponding processing circuits in the loop for operation.

Clause 47. The method of clause 45, wherein the method further comprises utilizing the pre-processing circuit to select a data splicing mode from a plurality of data splicing modes according to the parsed instruction, to perform an analysis of the input two data splicing modes. Perform the stitching operation.

Clause 48. The method of clause 47, wherein the method further comprises using the post-processing circuit to perform a compression operation on the data, the compression operation comprising filtering the data using a mask or passing a given threshold and a data size comparison to filter.

Clause 49. The method of clause 27, wherein the computing device further comprises a main storage circuit comprising at least one of a main storage module and a main cache module, wherein the method comprises utilizing the main storage The storage module is used to store the data used for performing the operation in the processing circuit and the operation result after the operation is performed, and the main buffer module is used to cache the intermediate operation result after the operation is performed in the processing circuit.

Clause 50. The method of any of clauses 27-49, wherein the method comprises utilizing the plurality of processing circuits to SIMT receive and execute the parsed instructions.

While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes and substitutions may occur to those skilled in the art without departing from the spirit and spirit of this disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. The appended claims are intended to define the scope of the disclosure, and therefore to cover equivalents and alternatives within the scope of these claims.

Claims

A computing device including a control circuit and a plurality of processing circuits, wherein:

the control circuit is configured to obtain and parse the instruction, and send the parsed instruction to one or more of the plurality of processing circuits; and

The plurality of processing circuits are configured to be connected in a one-dimensional or multi-dimensional array structure and to perform multi-threaded operations in accordance with the received parsed instructions.
The computing device of claim 1, wherein in parsing the instruction, the control circuit is configured to:

obtaining instruction identification information in the instruction; and

sending the parsed instruction to one or more of the plurality of processing circuits according to the instruction identification information;

or

In parsing the instruction, the control circuit is configured to:

decoding the instruction; and

The parsed instruction is sent to one or more of the plurality of processing circuits according to the result of the decoding and the operating state of the plurality of processing circuits.
The computing device of claim 1, wherein the plurality of processing circuits are divided into multiple types of processing circuits to perform different types of data processing.
11. The computing device of claim 1, wherein the plurality of processing circuits are divided into first type processing circuits and second type processing circuits, wherein the first type processing circuits are adapted to perform at least one of arithmetic and logical operations and the second type of processing circuit is adapted to perform at least one of a comparison operation and a table look-up operation.
1. The computing device of claim 1, wherein the multidimensional array is a two-dimensional array, and wherein the processing circuits located in the two-dimensional array are more than A predetermined two-dimensional spacing pattern is connected to the remaining one or more of the processing circuits in a row, column or diagonal, wherein the predetermined two-dimensional spacing pattern is associated with the number of processing circuits spaced in the connection.
10. The computing device of claim 1, wherein the multidimensional array is a three-dimensional array composed of a plurality of layers, wherein each layer includes a plurality of the processing circuits arranged in row, column, and diagonal directions 2D array where:

The processing circuits located in the three-dimensional array are in a predetermined three-dimensional spaced pattern in at least one of row, column, diagonal, and layer directions with those on the same row, column, diagonal, or different layer. The remaining one or more processing circuits are connected, wherein the predetermined three-dimensional spacing pattern is associated with the number of spacings and spacing layers between the processing circuits to be connected.
6. The computing device of claim 5 or 6, wherein the plurality of processing circuits are configured to determine whether to connect by logical connection based on the parsed designation.
11. The computing device of claim 1, wherein a plurality of the processing circuits are configured to form at least one group of processing circuits to process data according to a bit width of the received data.
The computing device of claim 8, wherein when a plurality of the processing circuit groups are formed to process data, the plurality of processing circuit groups are connected by logical connections according to the parsed instructions to form an or Multiple closed loops.
The computing device of claim 1, wherein each of the processing circuits comprises:

a logic operation circuit configured to perform logic operations according to the parsed instructions and received data when performing the multithreaded operation; and

A storage circuit comprising a data storage circuit and a predicate storage circuit, wherein the data storage circuit is configured to store at least one of the operation data of the processing circuit and an intermediate operation result, and the predicate storage circuit is configured to store the data using the The predicate storage circuit serial number and predicate information of each of the processing circuits obtained by the parsed instruction.
The computing device of claim 10, wherein the predicate storage circuit is further configured to:

update the predicate information according to the parsed instruction; or

The predicate information is updated according to the operation result of each of the processing circuits.
11. The computing device of claim 10, wherein each of the processing circuits is configured to:

obtaining the predicate information corresponding to the predicate storage circuit according to the predicate storage circuit sequence number in the parsed instruction; and

Whether the processing circuit executes the parsed instruction is determined according to the predicate information.
The computing device of claim 6, further comprising:

A data processing circuit comprising at least one of a pre-processing circuit and a post-processing circuit, wherein the pre-processing circuit is configured to perform a preprocessing operation on operation data before the processing circuit performs the operation, and the post-processing circuit is configured to After the processing circuit performs the operation, a post-processing operation is performed on the operation result.
14. The computing device of claim 13, wherein each of the plurality of processing circuits in the closed loop is configured with a respective logical address, the pre-processing circuit configured to perform at least one of the following:

According to the type and logical address of the operation data, the operation data is correspondingly split, and the multiple sub-data obtained after the split are respectively transferred to the corresponding processing circuits in the loop for operation; and

Select a data splicing mode from multiple data splicing modes according to the parsed instruction to perform a splicing operation on the two input data.
15. The computing device of claim 14, wherein the post-processing circuit is further configured to perform a compression operation on the data, the compression operation comprising filtering the data with a mask or by comparing a given threshold to a size of the data .
16. The computing device of any of claims 1-15, wherein the plurality of processing circuits are configured to receive and execute the parsed instructions in a SIMT fashion.
An integrated circuit chip, comprising the computing device according to any one of claims 1-16.
A board, comprising the integrated circuit chip according to claim 17 .
A method of performing computational operations using a computing device, wherein the computing device includes a control circuit and a plurality of processing circuits connected in a one-dimensional or multi-dimensional array structure, the method comprising:

using the control circuit to obtain and parse the instruction, and send the parsed instruction to one or more of the plurality of processing circuits; and

The one or more processing circuits are utilized to perform multi-threaded operations in accordance with the parsed instructions.
20. The method of claim 19, wherein in parsing the instruction, the method utilizes the control circuit to perform:

obtaining instruction identification information in the instruction; and

sending the parsed instruction to one or more of the plurality of processing circuits according to the instruction identification information;

or

In parsing the instruction, the control circuit is used to execute:

decoding the instruction; and

The parsed instruction is sent to one or more of the plurality of processing circuits according to the result of the decoding and the operating state of the plurality of processing circuits.
20. The method of claim 19 including dividing the plurality of processing circuits into multiple types of processing circuits to perform different types of data processing.
19. The method of claim 19, wherein dividing the plurality of processing circuits into a plurality of types of processing circuits comprises dividing the plurality of processing circuits into a first type of processing circuits and a second type of processing circuits, wherein the The first type of processing circuit is adapted to perform at least one of an arithmetic operation and a logical operation, and the second type of processing circuit is adapted to perform at least one of a comparison operation and a table look-up operation.
20. The method of claim 19, wherein the multidimensional array is a two-dimensional array, and the method comprises placing the processing circuits located in the two-dimensional array in their row, column, or diagonal directions At least one is connected to the other one or more of the processing circuits in the same row, the same column or the same diagonal in a predetermined two-dimensional spacing pattern, wherein the predetermined two-dimensional spacing pattern is connected with the processing circuits spaced in the connection number associated.
20. The method of claim 19, wherein the multidimensional array is a three-dimensional array composed of a plurality of layers, wherein each layer includes two of a plurality of the processing circuits arranged in row, column, and diagonal directions dimensional array, the method includes:

Aligning the processing circuits located in the three-dimensional array in at least one of the row, column, diagonal, and layer directions with a row, column, diagonal, or layer in a predetermined three-dimensional spacing pattern The remaining one or more processing circuits are connected, wherein the predetermined three-dimensional spacing pattern is associated with the number of spacings and the number of spacing layers between the processing circuits to be connected.
24. The method of claim 23 or 24, wherein whether the plurality of processing circuits are logically connected is determined based on the parsed designation.
19. The method of claim 19, wherein a plurality of the processing circuits are formed into at least one processing circuit group according to the bit width of the received data to process the data.
27. The method of claim 26, wherein when a plurality of the processing circuit groups are formed to process data, the method comprises connecting the plurality of processing circuit groups through logical connections according to the parsed instructions to form one or more closed loops.
20. The method of claim 19, wherein each of the processing circuits includes a logic operation circuit and a storage circuit, and the storage circuit includes a data storage circuit and a predicate storage circuit, the method comprising when performing the multithreaded operation , using the logic operation circuit to perform a logic operation according to the parsed instruction and the received data, and using the data storage circuit to store at least one of the operation data and the intermediate operation result of the processing circuit, And the predicate storage circuit is used to store the predicate storage circuit serial number and predicate information of each of the processing circuits obtained by using the parsed instruction.
29. The method of claim 28, further comprising utilizing the predicate storage circuit to perform the steps of:

update the predicate information according to the parsed instruction; or

The predicate information is updated according to the operation result of each of the processing circuits.
29. The method of claim 28, further comprising utilizing each of said processing circuits to perform the steps of:

obtaining the predicate information corresponding to the predicate storage circuit according to the predicate storage circuit sequence number in the parsed instruction; and

Whether the processing circuit executes the parsed instruction is determined according to the predicate information.
25. The method of claim 24, wherein the computing device further comprises a data processing circuit including at least one of a pre-processing circuit and a post-processing circuit, wherein the method further comprises, before the processing circuit performs an operation, utilizing The preprocessing circuit performs a preprocessing operation on the operation data, and after the processing circuit performs the operation, the postprocessing circuit is used to perform a postprocessing operation on the operation result.
The method of claim 31 , comprising configuring each of a plurality of processing circuits in the closed loop with a respective logical address, and utilizing the pre-processing circuit to perform at least one of the following:

According to the type and logical address of the operation data, the operation data is correspondingly split, and the multiple sub-data obtained after the split are respectively transferred to the corresponding processing circuits in the loop for operation; and

Select a data splicing mode from multiple data splicing modes according to the parsed instruction to perform a splicing operation on the two input data.
33. The method of claim 32, further comprising utilizing the post-processing circuit to perform a compression operation on the data, the compression operation including filtering the data with a mask or by comparing a given threshold to a size of the data .
33. The method of any of claims 19-33, comprising receiving and executing the parsed instructions in a SIMT manner with the plurality of processing circuits.