CN113867788A

CN113867788A - Computing device, chip, board card, electronic equipment and computing method

Info

Publication number: CN113867788A
Application number: CN202010618089.3A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2021-12-31
Also published as: WO2022001499A1

Abstract

The present disclosure discloses a computing device, an integrated circuit chip, a board and a method for performing an arithmetic operation using the aforementioned computing device. Where the computing device may be included in a combined processing device that may also include a general purpose interconnect interface and other processing devices. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing means may further comprise storage means connected to the device and the other processing means, respectively, for storing data of the device and the other processing means. The scheme disclosed by the invention can improve the operation efficiency of operation in various data processing fields including, for example, the artificial intelligence field, thereby reducing the overall overhead and cost of operation.

Description

Computing device, chip, board card, electronic equipment and computing method

Technical Field

The present disclosure relates generally to the field of computing. More particularly, the present disclosure relates to a computing device, an integrated circuit chip, a board, an electronic apparatus, and a computing method.

Background

In computing systems, an instruction set is a set of instructions for performing computations and controlling the computing system, and plays a critical role in improving the performance of a computing chip (e.g., a processor) in the computing system. Various types of computing chips (particularly those in the field of artificial intelligence) currently utilize associated instruction sets to perform various general or specific control operations and data processing operations. However, current instruction sets suffer from a number of drawbacks. For example, existing instruction sets are limited to hardware architectures and perform poorly in terms of flexibility. Further, many instructions can only complete a single operation, and multiple operations often require multiple instructions to be performed, potentially leading to increased on-chip I/O data throughput. In addition, current instructions have improvements in execution speed, execution efficiency, and power consumption for the chip.

In addition, the arithmetic instructions of a conventional processor CPU are designed to be able to perform basic single data scalar arithmetic operations. Here, a single data operation refers to an instruction where each operand is a scalar datum. However, in tasks such as image processing and pattern recognition, the oriented operands are often data types of multidimensional vectors (i.e., tensor data), and the operation tasks cannot be efficiently performed by hardware using only scalar operations. Therefore, how to efficiently execute multidimensional tensor operation is also an urgent problem to be solved in the current computing field.

Disclosure of Invention

To address at least the problems with the prior art described above, the present disclosure provides a hardware architecture suitable for executing very long instruction word ("VLIW") instructions. By utilizing the hardware architecture to execute the improved VLIW instructions, aspects of the present disclosure may achieve technical advantages in a number of aspects including enhancing processing performance of hardware, reducing power consumption, increasing execution efficiency of computing operations, and avoiding computational overhead. Further, the disclosed solution supports efficient access and processing of tensor data on the basis of the aforementioned hardware architecture, thereby accelerating tensor operations and reducing computation overhead brought by tensor operations in the case that multidimensional vector operands are included in computation instructions.

In a first aspect, aspects of the present disclosure provide a computing device comprising a plurality of processing circuits and a control circuit, wherein: the control circuitry is configured to fetch and parse a Very Long Instruction Word (VLIW) instruction, and to send the parsed VLIW instruction to the plurality of processing circuitry, wherein an operand of the VLIW instruction includes a descriptor indicating a shape of a tensor, and the control circuitry is configured to determine a storage address of data corresponding to the operand from the descriptor in the parsing; and

the plurality of processing circuits are connected in one or more processing circuit arrays in a one-dimensional or multi-dimensional array configuration and the one or more processing circuit arrays are configured to perform multi-threaded operations based on the parsed VLIW instructions and the memory addresses.

In a second aspect, the present disclosure provides an integrated circuit chip comprising a computing device as described above and as described in more detail in the following embodiments.

In a third aspect, the present disclosure provides a board card comprising an integrated circuit chip as described above and in detail in the embodiments below.

In a fourth aspect, the present disclosure provides an electronic device comprising an integrated circuit chip as described above and as detailed in various embodiments below.

In a fifth aspect, the present disclosure provides a method of performing a computing operation using a computing device as described above and in detail in the following embodiments, wherein the computing device comprises a control circuit and a plurality of processing circuits, the method comprising: fetching and parsing a Very Long Instruction Word (VLIW) instruction with the control circuitry, and sending the parsed VLIW instruction to the plurality of processing circuitry, wherein an operand of the VLIW instruction comprises a descriptor indicating a shape of a tensor, and the parsing comprises determining a storage address of data corresponding to the operand from the descriptor with the control circuitry; and connecting the plurality of processing circuits in a one-dimensional or multi-dimensional array configuration into one or more processing circuit arrays and performing multi-threaded operations using the one or more processing circuit arrays based on the resolved VLIW instructions and the memory address.

By using the computing device, the integrated circuit chip, the board card, the electronic equipment and the computing method provided by the disclosure, the processing circuits can be flexibly connected according to the received instruction, so that the VLIW instruction can be efficiently executed. Further, VLIW instructions improved based on the disclosed hardware architecture can be efficiently executed on the disclosed processing circuit array, thereby also improving the processing performance of the disclosed hardware architecture. In addition, based on the hardware architecture and flexible configuration and use of VLIW instructions of the present disclosure, execution efficiency of multi-threaded operations may be improved, thereby speeding up execution of computations.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the drawings, several embodiments of the disclosure are illustrated by way of example and not by way of limitation, and like or corresponding reference numerals indicate like or corresponding parts and in which:

FIG. 1a is a block diagram illustrating a computing device according to one embodiment of the present disclosure;

FIG. 1b is a schematic diagram illustrating a data storage space according to one embodiment of the present disclosure;

FIG. 2 is a block diagram illustrating a computing device according to another embodiment of the present disclosure;

FIG. 3 is a block diagram illustrating a computing device according to yet another embodiment of the present disclosure;

FIG. 4 is an example block diagram illustrating an array of various types of processing circuits of a computing device in accordance with embodiments of the disclosure;

FIGS. 5a, 5b, 5c and 5d are schematic diagrams illustrating various connections of processing circuits according to embodiments of the present disclosure;

6a, 6b, 6c and 6d are schematic diagrams illustrating further various connections of processing circuits according to embodiments of the present disclosure;

7a, 7b, 7c, and 7d are schematic diagrams illustrating various looping structures of a processing circuit according to embodiments of the present disclosure;

8a, 8b, and 8c are schematic diagrams illustrating additional various looping structures of processing circuitry in accordance with embodiments of the present disclosure;

9a, 9b, 9c, and 9d are schematic diagrams illustrating data stitching operations performed by pre-operative circuitry according to embodiments of the present disclosure;

10a, 10b, and 10c are schematic diagrams illustrating data compression operations performed by post-operation circuitry according to embodiments of the present disclosure;

FIG. 11 is a simplified flow diagram illustrating a method of performing an arithmetic operation using a computing device in accordance with an embodiment of the present disclosure;

FIG. 12 is a block diagram illustrating a combined treatment device according to an embodiment of the present disclosure; and

fig. 13 is a schematic diagram illustrating a structure of a board according to an embodiment of the disclosure.

Detailed Description

Aspects of the present disclosure provide a hardware architecture that supports VLIW instruction execution. When the hardware architecture is implemented in a computing device, the computing device includes at least control circuitry and a plurality of processing circuits that are connected according to different configurations to form various array architectures that support VLIW instruction execution. Depending on implementation, the VLIW instructions of the present disclosure may be used in some scenarios in combination with separate configuration instructions and data read-write instructions, and in other scenarios may be combined with the aforementioned configuration instructions and data read-write instructions to form an extended VLIW instruction. By means of the hardware architecture and VLIW instructions of the present disclosure, computational operations and data reads can be performed efficiently, expanding the application scenarios of computations and reducing computational overhead.

In the context of the present disclosure, the VLIW instructions, configuration instructions, and data read/write instructions may be instructions in an instruction system of an interactive interface of software and hardware, and may be machine languages in binary or other forms for receiving and processing by hardware such as a computing device (or processing circuit, processor). The VLIW instructions, configuration instructions, and data read and write instructions may include operation codes and operands for directing processor operations. The VLIW instruction, the configuration instruction and the data read-write instruction may comprise one or more operation codes according to different application scenarios. When the VLIW instruction, the configuration instruction, and the data read/write instruction include one operation code, the operation code may be used to instruct a plurality of operations of the computing device.

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

FIG. 1a is a block diagram illustrating a computing device 100 according to one embodiment of the present disclosure. As shown in fig. 1a, the computing device 100 includes a control circuit 102 and a plurality of processing circuits 104. In an embodiment the control circuitry is configured to fetch and parse VLIW instructions, operands of which comprise descriptors indicating the shape of the tensor, and to send the parsed VLIW instructions to the plurality of processing circuits 104, the control circuitry may be configured to determine from the descriptors, in the parsing, a storage address of data corresponding to the operands. In another embodiment the plurality of processing circuits are connected in a one-dimensional or multi-dimensional array configuration into one or more processing circuit arrays and the one or more processing circuit arrays are configured to perform multi-threaded operations based on the resolved VLIW instructions and the memory address. In the present disclosure, the parsed VLIW instruction may include at least one of an arithmetic instruction, a pre-processing instruction, a post-processing instruction, and a move instruction, and the arithmetic instruction, the pre-processing instruction, the post-processing instruction, and the move instruction may be a microinstruction or a control signal that is executed inside a computing device (or a processing circuit, a processor).

According to different application scenarios, the connection may be a hardware-based configuration connection (or "hard connection") between a plurality of processing circuits, or a logical configuration connection (or "soft connection") based on a specific hardware connection through a configuration instruction described later. To implement the aforementioned configuration connections to one or more processing circuits, the control circuit of the present disclosure may also obtain configuration instructions, and the plurality of processing circuits are connected according to the received configuration instructions to form the one or more processing circuit arrays. In one embodiment, the processing circuit array may form a closed loop in at least one of the one or more dimensional directions according to the configuration instructions, i.e. a "looped structure" in the context of the present disclosure.

In one embodiment, the control circuitry is configured to send at least one of a constant term and an entry to the array of processing circuitry in accordance with the configuration instruction in order to perform the multi-threaded operation. In one application scenario, the constant term and the table entry may be stored in a register of the control circuit, and the control circuit reads the constant term and the table entry from the register according to the configuration instruction. In another application scenario, the constant term and table entry may be stored on an on-chip memory circuit (such as the memory circuit shown in fig. 2 or 3) or on an off-chip memory circuit. In this case, the storage addresses of the constant term and the table entry may be included in the configuration instruction, and the processing circuit array may obtain the constant term and/or the table entry required for calculation from the corresponding on-chip or off-chip storage circuit according to the storage addresses.

In one embodiment, the control circuitry may comprise one or more registers storing configuration information about the array of processing circuitry, the control circuitry being configured to read the configuration information from the registers and send it to the processing circuitry in accordance with the configuration instructions for the processing circuitry to connect with the configuration information. In one application scenario, the configuration information may include preset location information of the processing circuits constituting the one or more processing circuit arrays, and the location information may include, for example, coordinate information or label information of the processing circuits. When the processing circuit array configuration forms a closed loop, the configuration information may further include looping configuration information regarding the processing circuit array forming a closed loop. Alternatively, in one embodiment, the configuration information may be carried directly by the configuration instruction instead of being read from the register. In this case, the processing circuit may be configured directly according to the position information in the received configuration instruction to form an array without a closed loop or further form an array with a closed loop with other processing circuits.

In configuring the connections to form a two-dimensional array by configuration instructions or by configuration information obtained from registers, the processing circuits located in the two-dimensional array are configured to connect with the remaining one or more of the processing circuits in the same row, column or diagonal in at least one of their row, column or diagonal directions in a predetermined two-dimensional pattern of intervals so as to form one or more closed loops. Here, the aforementioned predetermined two-dimensional spacing pattern is associated with the number of processing circuits spaced in the connection.

Further, when the connection is configured to form a three-dimensional array in accordance with the aforementioned configuration instruction or configuration information, the processing circuit arrays are connected in a loop of a three-dimensional array constituted by a plurality of layers, wherein each layer includes a two-dimensional array of a plurality of the processing circuits arranged in a row direction, a column direction, and a diagonal direction, and wherein: the processing circuits located in the three-dimensional array are configured to connect with the remaining one or more processing circuits in the same row, column, diagonal, or different layers in at least one of their row, column, diagonal, and layer directions in a predetermined three-dimensional spacing pattern so as to form one or more closed loops. Here, the predetermined three-dimensional spacing pattern is associated with the number of spaces and the number of layers of spaces between the processing circuits to be connected.

In one embodiment, the VLIW instructions of the present disclosure include one or more arithmetic instructions, and the aforementioned one or more processing circuit arrays may be configured to perform multithreaded arithmetic operations according to the arithmetic instructions. The one or more operational instructions may be microinstructions or control signals that are executed within a computing device (or processing circuit, processor), which may include (or indicate) one or more operations that are to be performed by the computing device. The arithmetic operation may include various operations such as an addition operation, a multiplication operation, a convolution operation, a pooling operation, etc., and the present disclosure does not limit the specific type of the arithmetic operation.

In one application scenario, the plurality of processing circuit arrays may be configured to each execute a different operational instruction. In another application scenario, at least two of the plurality of processing circuit arrays may be configured to execute the same operational instruction. In one embodiment, the VLIW instruction may also include a move instruction. The processing circuit array may be configured to perform move operations on data between processing circuits according to the move instruction. In one application scenario, the move instruction may further comprise a mask instruction, such that the processing circuit array may be configured to selectively move data according to the mask instruction, e.g. to move unmasked data without moving masked data. In one application scenario, the move instruction may further comprise register identification information for indicating a source register and a destination register for moving data between the processing circuits, such that the processing circuits may be configured to move data from said source register to said destination register in dependence on said register identification information.

As previously described, the multithreading of the present disclosure further includes obtaining information about tensor shapes using descriptors to determine storage addresses of tensor data, thereby obtaining and saving the tensor data by the aforementioned storage addresses.

In one possible implementation, the shape of the N-dimensional tensor data may be indicated by a descriptor, N being a positive integer, e.g., N ═ 1,2, or 3, or zero. The tensor can include various forms of data composition, the tensor can be of different dimensions, for example, a scalar can be regarded as a 0-dimensional tensor, a vector can be regarded as a 1-dimensional tensor, and a matrix can be a 2-dimensional tensor or a tensor with more than 2 dimensions. The shape of the tensor includes information such as the dimensions of the tensor, the sizes of the dimensions of the tensor, and the like. For example, for a tensor:

the shape of the tensor can be described by a descriptor as (2, 4), i.e. the tensor is represented by two parameters as a two-dimensional tensor, with the size of the first dimension (column) of the tensor being 2 and the size of the second dimension (row) being 4. It should be noted that the manner in which the descriptors indicate the tensor shape is not limited in the present application.

In one possible implementation, the value of N may be determined according to the dimension (order) of the tensor data, or may be set according to the usage requirement of the tensor data. For example, when the value of N is 3, the tensor data is three-dimensional tensor data, and the descriptor may be used to indicate the shape (e.g., offset, size, etc.) of the three-dimensional tensor data in three dimensional directions. It should be understood that the value of N can be set by those skilled in the art according to practical needs, and the disclosure does not limit this.

In one possible implementation, the descriptor may include an identification of the descriptor and/or the content of the descriptor. The identifier of the descriptor is used to distinguish the descriptor, for example, the identifier of the descriptor may be its number; the content of the descriptor may include at least one shape parameter representing a shape of the tensor data. For example, the tensor data is 3-dimensional data, of three dimensions of the tensor data, in which shape parameters of two dimensions are fixed, the content of the descriptor thereof may include a shape parameter representing another dimension of the tensor data.

In one possible implementation, the identity and/or content of the descriptor may be stored in a descriptor storage space (internal memory), such as a register, an on-chip SRAM or other media cache, or the like. The tensor data indicated by the descriptors may be stored in a data storage space (internal memory or external memory), such as an on-chip cache or an off-chip memory, etc. The present disclosure does not limit the specific locations of the descriptor storage space and the data storage space.

In one possible implementation, the identity, content, and tensor data indicated by the descriptors may be stored in the same block of internal memory, e.g., a contiguous block of on-chip cache may be used to store the relevant content of the descriptors at addresses ADDR0-ADDR 1023. The addresses ADDR0-ADDR63 can be used as a descriptor storage space to store the identifier and content of the descriptor, and the addresses ADDR64-ADDR1023 can be used as a data storage space to store tensor data indicated by the descriptor. In the descriptor memory space, the identifiers of the descriptors may be stored with addresses ADDR0-ADDR31, and the contents of the descriptors may be stored with addresses ADDR32-ADDR 63. It should be understood that the address ADDR is not limited to 1 bit or one byte, and is used herein to mean one address, which is a unit of one address. The descriptor storage space, the data storage space, and their specific addresses may be determined by those skilled in the art in practice, and the present disclosure is not limited thereto.

In one possible implementation, the identity of the descriptors, the content, and the tensor data indicated by the descriptors may be stored in different areas of internal memory. For example, a register may be used as a descriptor storage space, the identifier and the content of the descriptor may be stored in the register, an on-chip cache may be used as a data storage space, and tensor data indicated by the descriptor may be stored.

In one possible implementation, where a register is used to store the identity and content of a descriptor, the number of the register may be used to represent the identity of the descriptor. For example, when the number of the register is 0, the identifier of the descriptor stored therein is set to 0. When the descriptor in the register is valid, an area in the buffer space can be allocated for storing the tensor data according to the size of the tensor data indicated by the descriptor.

In one possible implementation, the identity and content of the descriptors may be stored in an internal memory and the tensor data indicated by the descriptors may be stored in an external memory. For example, the identification and content of the descriptors may be stored on-chip, and the tensor data indicated by the descriptors may be stored under-chip.

In one possible implementation, the data address of the data storage space corresponding to each descriptor may be a fixed address. For example, separate data storage spaces may be divided for tensor data, each of which has a one-to-one correspondence with descriptors at the start address of the data storage space. In this case, the control circuit may determine the data address of the data corresponding to the operand in the data storage space based on the descriptor.

In one possible implementation, when the data address of the data storage space corresponding to the descriptor is a variable address, the descriptor may be further used to indicate an address of tensor data of the N dimension, where the content of the descriptor may further include at least one address parameter indicating the address of the tensor data. For example, the tensor data is 3-dimensional data, when the descriptor points to an address of the tensor data, the content of the descriptor may include one address parameter indicating the address of the tensor data, such as a starting physical address of the tensor data, or may include a plurality of address parameters of the address of the tensor data, such as a starting address of the tensor data + an address offset, or the tensor data is based on the address parameters of each dimension. The address parameters can be set by those skilled in the art according to practical needs, and the disclosure does not limit this.

In one possible implementation, the address parameter of the tensor data may include a reference address of a data reference point of the descriptor in a data storage space of the tensor data. Wherein the reference address may be different according to a variation of the data reference point. The present disclosure does not limit the selection of data reference points.

In one possible implementation, the base address may include a start address of the data storage space. When the data reference point of the descriptor is the first data block of the data storage space, the reference address of the descriptor is the start address of the data storage space. When the data reference point of the descriptor is data other than the first data block in the data storage space, the reference address of the descriptor is the address of the data block in the data storage space.

In one possible implementation, the shape parameters of the tensor data include at least one of: the size of the data storage space in at least one of N dimensional directions, the size of the storage area in at least one of N dimensional directions, the offset of the storage area in at least one of N dimensional directions, the positions of at least two vertices located at diagonal positions in the N dimensional directions relative to the data reference point, and the mapping relationship between the data description positions of tensor data indicated by the descriptors and the data addresses. Where the data description position is a mapping position of a point or a region in the tensor data indicated by the descriptor, for example, when the tensor data is 3-dimensional data, the descriptor may represent a shape of the tensor data using three-dimensional space coordinates (x, y, z), and the data description position of the tensor data may be a position of a point or a region in the three-dimensional space to which the tensor data is mapped, which is represented using three-dimensional space coordinates (x, y, z).

It should be understood that shape parameters representing tensor data can be selected by one skilled in the art based on practical considerations, which are not limited by the present disclosure. By using the descriptor in the data access process, the association between the data can be established, thereby reducing the complexity of data access and improving the instruction processing efficiency.

In one possible implementation, the content of the descriptor of the tensor data may be determined according to a reference address of a data reference point of the descriptor in a data storage space of the tensor data, a size of the data storage space in at least one of N dimensional directions, a size of the storage area in at least one of the N dimensional directions, and/or an offset of the storage area in at least one of the N dimensional directions.

FIG. 1b shows a schematic diagram of a data storage space according to an embodiment of the present disclosure. As shown in fig. 1b, the data storage space 21 stores a two-dimensional data in a line-first manner, which can be represented by (X, Y) (where the X axis is horizontally right and the Y axis is vertically downward), the size in the X axis direction (the size of each line) is ori _ X (not shown in the figure), the size in the Y axis direction (the total number of lines) is ori _ Y (not shown in the figure), and the starting address PA _ start (the reference address) of the data storage space 21 is the physical address of the first data block 22. The data block 23 is partial data in the data storage space 21, and its offset amount 25 in the X-axis direction is denoted as offset _ X, the offset amount 24 in the Y-axis direction is denoted as offset _ Y, the size in the X-axis direction is denoted as size _ X, and the size in the Y-axis direction is denoted as size _ Y.

In a possible implementation manner, when the descriptor is used to define the data block 23, the data reference point of the descriptor may use the first data block of the data storage space 21, and the reference address of the descriptor may be agreed as the starting address PA _ start of the data storage space 21. The content of the descriptor of the data block 23 may then be determined in combination with the size ori _ X of the data storage space 21 in the X axis, the size ori _ Y in the Y axis, and the offset amount offset _ Y of the data block 23 in the Y axis direction, the offset amount offset _ X in the X axis direction, the size _ X in the X axis direction, and the size _ Y in the Y axis direction.

In one possible implementation, the content of the descriptor can be represented using the following formula (1):

it should be understood that although the content of the descriptor is represented by a two-dimensional space in the above examples, a person skilled in the art can set the specific dimension of the content representation of the descriptor according to practical situations, and the disclosure does not limit this.

In one possible implementation manner, a reference address of a data reference point of the descriptor in the data storage space may be defined, and based on the reference address, the content of the descriptor of the tensor data is determined according to the positions of at least two vertexes located at diagonal positions in the N-dimensional directions relative to the data reference point.

For example, a reference address PA _ base of a data reference point of the descriptor in the data storage space may be agreed. For example, one data (for example, data with position (2, 2)) may be selected as a data reference point in the data storage space 21, and the physical address of the data in the data storage space may be used as the reference address PA _ base. The content of the descriptor of the data block 23 in fig. 1b can be determined from the position of the two vertices of the diagonal position relative to the data reference point. First, the positions of at least two vertices of the diagonal positions of the data block 23 relative to the data reference point are determined, for example, the positions of the diagonal position vertices relative to the data reference point in the top-left-to-bottom-right direction are used, wherein the relative position of the top-left vertex is (x _ min, y _ min), and the relative position of the bottom-right vertex is (x _ max, y _ max), and then the content of the descriptor of the data block 23 can be determined according to the reference address PA _ base, the relative position of the top-left vertex (x _ min, y _ min), and the relative position of the bottom-right vertex (x _ max, y _ max).

In one possible implementation, the content of the descriptor (with reference to PA _ base) can be represented using the following equation (2):

it should be understood that although the above examples use the vertex of two diagonal positions of the upper left corner and the lower right corner to determine the content of the descriptor, the skilled person can set the specific vertex of at least two vertices of the diagonal positions according to the actual needs, and the disclosure does not limit this.

In one possible implementation manner, the content of the descriptor of the tensor data can be determined according to a reference address of the data reference point of the descriptor in the data storage space and a mapping relation between the data description position and the data address of the tensor data indicated by the descriptor. For example, when tensor data indicated by the descriptor is three-dimensional space data, the mapping relationship between the data description position and the data address may be defined by using a function f (x, y, z).

In one possible implementation, the content of the descriptor can be represented using the following equation (3):

in one possible implementation, the descriptor is further configured to indicate an address of the N-dimensional tensor data, where the content of the descriptor further includes at least one address parameter indicating the address of the tensor data, for example, the content of the descriptor may be:

where PA is the address parameter. The address parameter may be a logical address or a physical address. The descriptor parsing circuit may obtain a corresponding data address by using PA as any one of a vertex, a middle point, or a preset point of a vector shape in combination with shape parameters in the X direction and the Y direction.

In one possible implementation, the address parameter of the tensor data includes a reference address of a data reference point of the descriptor in a data storage space of the tensor data, and the reference address includes a start address of the data storage space.

In one possible implementation, the descriptor may further include at least one address parameter representing an address of the tensor data, for example, the content of the descriptor may be:

wherein PA _ start is a reference address parameter, which is not described again.

It should be understood that, the mapping relationship between the data description location and the data address can be set by those skilled in the art according to practical situations, and the disclosure does not limit this.

In a possible implementation manner, a default base address can be set in a task, the base address is used by descriptors in instructions in the task, and shape parameters based on the base address can be included in the descriptor contents. This base address may be determined by setting an environmental parameter for the task. The relevant description and usage of the base address can be found in the above embodiments. In this implementation, the content of the descriptor can be mapped to the data address more quickly.

In one possible implementation, the reference address may be included in the content of each descriptor, and the reference address of each descriptor may be different. Compared with a mode of setting a common reference address by using environment parameters, each descriptor in the mode can describe data more flexibly and use a larger data address space.

In one possible implementation, the data address in the data storage space of the data corresponding to the operand of the processing instruction may be determined according to the content of the descriptor. The calculation of the data address is automatically completed by hardware, and the calculation methods of the data address are different when the content of the descriptor is represented in different ways. The present disclosure does not limit the specific calculation method of the data address.

For example, the content of the descriptor in the operand is expressed by formula (1), the amount of shift of the tensor data indicated by the descriptor in the data storage space is offset _ x and offset _ y, respectively, and the size is size _ x × size _ y, then the starting data address PA1 of the tensor data indicated by the descriptor in the data storage space is_(x,y)The following equation (4) may be used to determine:

PA1_(x,y)＝PA_start+(offset_y-1)*ori_x+offset_x (4)

the data start address PA1 determined according to the above formula (4)_(x,y)In combination with the offsets offset _ x and offset _ y and the sizes size _ x and size _ y of the storage area, the storage area of the tensor data indicated by the descriptor in the data storage space can be determined.

In a possible implementation manner, when the operand further includes a data description location for the descriptor, a data address of data corresponding to the operand in the data storage space may be determined according to the content of the descriptor and the data description location. In this way, a portion of the data (e.g., one or more data) in the tensor data indicated by the descriptor may be processed.

For example, the content of the descriptor in the operand is expressed by formula (1), the tensor data indicated by the descriptor is respectively offset in the data storage space by offset _ x and offset _ y, the size is size _ x × size _ y, and the data description position for the descriptor included in the operand is (x) x_q，y_q) Then, the data address PA2 of the tensor data indicated by the descriptor in the data storage space_(x,y)The following equation (5) may be used to determine:

PA2_(x,y)＝PA_start+(offset_y+y_q-1)*ori_x+(offset_x+x_q) (5)

the computing device of the present disclosure is described above with reference to fig. 1a and 1b, and by using one or more processing circuit arrays in the computing device and based on the operation functions of the processing circuits, the VLIW instructions of the present disclosure are efficiently executed on the computing device to complete multi-thread operations, thereby improving the execution efficiency of parallel operations and reducing the computation overhead. In addition, by using the VLIW instruction to perform the operations for the tensor, the disclosed scheme also significantly improves the access and processing efficiency of tensor data, and reduces the overhead for tensor operations.

FIG. 2 is a block diagram illustrating a computing device 200 according to another embodiment of the present disclosure. As can be seen, the computing device 200 in fig. 2 includes a memory circuit 106 in addition to the control circuit 102 and the plurality of processing circuits 104 that are the same as the computing device 100. In one embodiment, the control circuitry may be further configured to obtain data read and write instructions and to send the data read and write instructions to the storage circuitry such that the storage circuitry performs read and write operations of data associated with the multi-threaded operations in accordance with the data read and write instructions.

In an application scenario, the storage circuit may be configured with interfaces for data transmission in multiple directions so as to be connected to the processing circuits 104, so that data to be operated by the processing circuits, intermediate results obtained during operation, and operation results obtained after operation can be stored accordingly. In view of the foregoing, in one application scenario, the storage circuit of the present disclosure may include a main storage module and/or a main cache module, wherein the main storage module is configured to store data for performing operations in the processing circuit array and operation results after performing operations, and the main cache module is configured to cache intermediate operation results after performing operations in the processing circuit array. Further, the storage circuit may also have an interface for data transmission with an off-chip storage medium, so that data transfer between on-chip and off-chip systems may be achieved.

FIG. 3 is a block diagram illustrating a computing device 300 according to yet another embodiment of the present disclosure. As can be seen, in addition to including the same control circuitry 102, plurality of processing circuitry 104, and storage circuitry 106 as the computing device 200, the computing device 300 in fig. 3 also includes data manipulation circuitry 108, which includes pre-manipulation circuitry 110 and post-manipulation circuitry 112. Based on such a hardware architecture, the VLIW instructions of the present disclosure may include pre-processing instructions and/or post-processing instructions, wherein the pre-operation circuitry may be configured to perform pre-processing operations on input data of the multi-threaded operations according to the pre-processing instructions, and the post-operation circuitry may be configured to perform post-processing operations on output data of the multi-threaded operations according to the post-processing instructions.

In an application scenario, the pre-operation circuit may split the operation data according to the type of the operation data and the logical address of each processing circuit, and transmit the plurality of sub-data obtained after splitting to each corresponding processing circuit in the array for operation. In another application scenario, the pre-operation circuit may select one data splicing mode from multiple data splicing modes according to the parsed instruction, so as to perform a splicing operation on the two input data. In one application scenario, the post-operation circuitry may be configured to perform compression operations on the data, including filtering the data with a mask or by comparison of a given threshold to a data size, to achieve compression of the data.

FIG. 4 is an example block diagram illustrating an array of various types of processing circuitry of a computing device 400 according to an embodiment of this disclosure. As can be seen from the figure, the computing apparatus 400 shown in fig. 4 has a similar architecture to the computing apparatus 300 shown in fig. 3, so that the description of the computing apparatus 300 in fig. 3 also applies to the same details shown in fig. 4, and therefore the description thereof is omitted.

As can be seen in fig. 4, the plurality of processing circuits may include, for example, a plurality of first type processing circuits 104-1 and a plurality of second type processing circuits 104-2 (distinguished by different background colors in the figure). The plurality of processing circuits may be arranged by physical connections to form a two-dimensional array. For example, as shown in the figure, there are M rows and N columns (denoted as M x N) of processing circuits of the first type in the two-dimensional array, where M and N are positive integers greater than 0. The first type of processing circuit may be used to perform arithmetic and logical operations and may include, for example, linear operations such as addition, subtraction and multiplication, comparison operations, and nor non-linear operations, or any of a variety of combinations of the foregoing. Further, there are two columns of (M × 2+ M × 2) second-type processing circuits on the left and right sides of the periphery of the M × N first-type processing circuit arrays, and there are two rows of (N × 2+8) second-type processing circuits on the lower side of the periphery, that is, the processing circuit arrays have (M × 2+ N2 +8) second-type processing circuits in total. In one embodiment, the second type of processing circuit may be adapted to perform non-linear operations on the received data, such as comparison operations, table lookup operations or shift operations.

In some application scenarios, the memory circuits employed by both the first type of processing circuit and the second type of processing circuit may have different memory sizes and memory manners. For example, a predicate storage circuit in a first type of processing circuit may store predicate information using a plurality of numbered registers. Further, the first type of processing circuit may access predicate information in a register of a corresponding number according to a register number specified in the received parsed instruction. As another example, the second type of processing circuit may store the predicate information in a static random access memory ("SRAM"). Specifically, the second-type processing circuit may determine a storage address of the predicate information in the SRAM according to an offset of a location where the predicate information is specified in the received parsed instruction, and may perform a predetermined read or write operation on the predicate information in the storage address.

The basic composition and the extended architecture of the computing device of the present disclosure are described in detail above with reference to fig. 1 (including fig. 1a and 1b) -4, and the configuration instructions for configuring the connections of the processing circuits, the data read/write instructions for performing the data read/write operations, and the VLIW instructions for performing various computing operations, which are mentioned above, are described in detail below.

Configuration instructions：

As previously described, the configuration instructions of the present disclosure may be used to configure processing circuitry to execute subsequent data read and write instructions and VLIW instructions. In an exemplary implementation, the configuration instruction may include a plurality of instruction fields for configuring the processing circuit array. For example, the instruction field may be used to indicate attributes of a plurality of processing circuits connected in a two-dimensional matrix structure, such as various types of configuration information indicating a looping pattern and a data type of the processing circuits in a horizontal direction or a vertical direction, various registers of the processing circuits in the horizontal direction or the vertical direction, correlation information of constant items and table entries, memory addresses, predicate logic, an instruction field for predicate operations, and the like. For example, the looping pattern may include 4, 8, or 16 processing circuits connected in a loop, or 16 processing circuits of a first type and 4 processing circuits of a second type, or no loop. The difference in looping patterns will affect the flow pattern of data in the processing array. The following is an exemplary description of the specific execution of the configuration instructions by the control circuit.

In the execution process of the configuration instruction, the control circuit may first perform a lookup comparison with the internal predicate register according to a value of an instruction field for the predicate operation in the configuration instruction, thereby determining whether to execute a current configuration of the processing circuit. When it is determined that the configuration needs to be performed, the control circuit may read the memory storing the coordinates of the processing circuit in the horizontal direction from the internal register, thereby acquiring the processing circuit coordinates in the horizontal direction configuring the processing circuit array. Further, the control circuit may acquire looping mode information in the horizontal direction directly from the immediate field. Similar operations are also applicable to the coordinates and looping mode information of the processing circuit in the vertical direction.

Then, the control circuit may send the coordinate information and the looping mode information to the processing circuit array together, and the processing circuit array may configure various internal registers of a single processing circuit array in the processing circuit array according to the coordinate information, for example, may write the coordinate information into and modify values of the horizontal looping configuration register and the vertical looping configuration register. Here, the values of the horizontal or vertical looping configuration registers may be used to determine the direction of data flow for the current array of processing circuits, and thus the manner in which the processing circuits in the array are looped.

For configuring the constant term, if the relevant instruction field indicates that the constant term needs to be configured, the control circuit may choose to fetch the constant value from the register or directly from the immediate field according to the source of the constant. For example, when a constant entry in a configuration instruction field indicates that the constant entry is from a register, then the control circuitry may obtain the constant value by reading the register with the specified number. Thereafter, the control circuit may transmit the acquired constant value to the processing circuit array.

For configuration entries, in some scenarios, the size of the entry may exceed the bit width of the instruction and it is difficult to write all of the entries directly in the instruction, so the entry contents are often pre-stored in the storage circuit. If the associated instruction field indicates that the table entry needs to be configured, the control unit may request the memory circuit to read the memory address of the configuration table entry, and the address parameter may be from a register location identified in the configuration instruction. When the storage circuitry (e.g., main storage module) receives the request, the entry data may be returned to the processing circuitry array. In one application scenario, the processing circuit array after obtaining the entry data may save the configured entry to an internal storage circuit (or register). In one embodiment, the entire processing circuit array may share a copy of the storage of constant term and table entry data.

Data read-write instruction：

As described above, the control circuit of the present disclosure may send the parsed data read/write instruction (including a read request instruction and/or a write request instruction) to the storage circuit, so that the storage circuit performs data interaction with the processing circuit array according to the data read/write instruction. In one application scenario, the memory circuit may include input (or write) and output (or read) ports in multiple directions for connection with one or more processing circuits in an array of processing circuits. Based on this, the plurality of instruction fields of the data read-write instruction include information such as read request information of a specific read port or read ports and write request information of a specific write port or write ports.

In one embodiment, the read request information at least includes address information and data amount information of data to be read, and the write request information at least includes address information and data amount information of data to be written. For example, for multidimensional data, the data volume information may include a data volume size of a first dimension of the request, an address span size of a second dimension, a number of iterations of the second dimension, an address span size of a third dimension, a number of iterations, and so on.

In the process of executing the data read-write instruction, the control circuit obtains a plurality of read-write request information after analyzing the data read-write instruction. It may then determine whether each request needs to be executed according to predicate logic. For a read request satisfying the current execution condition, the control circuit sends read request information to the storage circuit. Thereafter, the memory circuit acquires, for example, three-dimensional multidimensional data from the corresponding memory address according to the read request information, and returns the data to the processing circuit array. In contrast, for a write request, the control circuitry may send write request information to the processing circuitry array to control the manner in which data is output from the processing circuitry array during execution of a block containing VLIW instructions. The processing circuit array may then output data to the memory circuit during execution of subsequent instructions in accordance with the aforementioned write request information. Thereafter, the storage circuitry writes the received data back into the local storage space.

VLIW instruction：

The VLIW instructions of the present disclosure may include one or more operational instructions, and the processing circuit array may be configured to perform multithreaded operational operations according to the operational instructions. In one embodiment, the plurality of processing circuit arrays are configured to each execute a different operational instruction. In another embodiment, at least two of the plurality of processing circuit arrays are configured to execute the same operational instruction.

In one application scenario, a VLIW instruction of the present disclosure may include an instruction field for instructing operations of a plurality of input ports and output ports of a processing circuit array, an instruction field for moving data in a horizontal and/or vertical direction by the processing circuit array, an instruction field for processing a specific operation performed by the circuit array, and the like.

For example, processing circuit array number 0 input port operation may represent a pre-processing operation of number 0 read port on input data. The preprocessing operations herein may include operations such as stitching, table lookup, data type conversion, etc., which may be performed by the pre-operation circuitry 110 of fig. 2 or 3. In addition, by specifying the destination of the input data (e.g., the move operation described above), the input data can be sent directly to the processing circuit to perform subsequent operations, and can also be used to modify the values of the internal registers of the current processing circuit. In one application scenario, the internal registers may include internal general purpose registers and special purpose registers such as predicate registers, etc. In addition, the instruction domain of the input port operation may further include an instruction domain for indicating predicate information, so that each processing circuit in the processing circuit array may compare the predicate information with a predicate register inside the processing circuit array to determine whether to perform the operation of the current input port.

The instruction field of the processing circuit array for moving data in the horizontal and/or vertical direction specifies operation information in the processing circuit for data movement in the horizontal and/or vertical direction, which may include, for example, mask information regarding data movement for masking movement of portions of data, identification of source registers transmitted to neighboring processing circuits, identification of destination registers transmitted to neighboring processing circuits, looping register identification for selecting different registers for data flow, and predicate information for use in control circuitry and/or within the processing circuit interior for predicate logic to decide whether or not the current instruction field executes.

During execution of a VLIW instruction, the processing circuit arrays may issue an instruction field comprising the above information to each processing circuit in each processing circuit array. After receiving the information, each processing circuit may determine whether the current processing circuit performs a data move operation according to a comparison of the current predicate logic register information and predicate information in the instruction domain. If a data move operation is performed, the source processing circuitry may read data from the designated local source register and perform masking processing based on the foregoing masking information, and derive the location of the target processing circuitry in the given direction of movement based on the information in the designated looping register. The source processing circuitry may then send the masked data to the destination register of the specified number in the destination processing circuitry. The above-described one-time data transfer process may occur in each processing circuit.

The instruction field indicating the specific operation executed inside the processing circuit may include various operation-related information such as operand source information, operand register location information, destination register information for storing an operation result, description information of the operation, data type information of the operation process, predicate information for controlling the circuit and the processing circuit to perform predicate operation, and the like.

During execution, each processing circuit may perform predicate judgment according to the predicate information and a predicate register inside the processing circuit to determine whether the current processing circuit performs an operation. If executed, the processing circuitry may read registers internal to the processing circuitry in accordance with operand register information to obtain operands for the operation. Then, the type of operation can be determined from the above-described operation description information. After the operation is finished, the processing circuit may write back the result obtained after the operation to a register inside the processing circuit according to the destination register information of the operation result.

The description of performing the tensor operation described above in connection with fig. 1b using the VLIW instruction of the present disclosure will be described below, taking the example where the source operand relates to two tensors and the addition operation of the two is performed.

First, the control circuit of the present disclosure may obtain the storage addresses of the data corresponding to the operands from the parsed instructions, for example, the starting addresses src1_ addr, src2_ addr and dst _ addr of the tensors src1 and src2 and the destination tensor dst (i.e., result tensor) may be calculated respectively.

Then, the control circuit may transmit address information of the source tensor and the destination tensor to the storage circuit, and transmit information including an operation type, an operation data type, and the like to the processing circuit that performs the tensor operation. In response to receiving the foregoing information, the storage circuits read the respective tensors src1 and src2 starting from the start address and transmit to the respective processing circuits. Subsequently, the processing circuit may perform an addition operation on the two tensors src1 and src2 and, after obtaining the resulting tensor dst, send to the storage circuit. Finally, the storage circuit may write back the result tensor dst to the destination memory space, i.e. the storage circuit stores the result data at its starting memory address.

From the above description of the configuration instructions, the data read-write instructions and the VLIW instructions, it can be seen that the VLIW instructions, the configuration instructions and the data read-write instructions include respective corresponding predicates, and the control circuit, the processing circuit and the storage circuit are configured to determine whether to execute the VLIW instructions, the configuration instructions and/or the data read-write instructions according to the corresponding predicates. Further, the VLIW instruction of the present disclosure may be combined with at least one of the configuration instruction and the data read-write instruction to form an extended VLIW instruction according to different application scenarios. Therefore, the instruction can be further simplified, and the efficiency of executing the instruction can be improved.

Fig. 5a, 5b, 5c and 5d are schematic diagrams illustrating various connection relationships of processing circuits according to embodiments of the present disclosure. As previously mentioned, the processing circuits of the present disclosure may be connected in a hard-wired manner or in a logically connected manner according to configuration instructions, thereby forming a topology of a connected one-or multi-dimensional array. When a plurality of processing circuits are connected in a multi-dimensional array, the multi-dimensional array may be a two-dimensional array, and the processing circuits located in the two-dimensional array may be connected in at least one of a row direction, a column direction, or a diagonal direction thereof with the remaining one or more of the processing circuits in the same row, the same column, or the same diagonal in a predetermined two-dimensional spacing pattern. Wherein the predetermined two-dimensional spacing pattern may be associated with a number of processing circuits spaced in the connection. Fig. 5a to 5c illustrate topologies of various forms of two-dimensional arrays between a plurality of processing circuits.

As shown in fig. 5a, five processing circuits (each represented by a box) are connected to form a simple two-dimensional array. Specifically, one processing circuit is connected to each of four directions, horizontal and vertical with respect to the processing circuit, with one processing circuit as the center of the two-dimensional array, thereby forming a two-dimensional array having a size of three rows and three columns. Further, since the processing circuits located at the center of the two-dimensional array are directly connected to the processing circuits adjacent to the previous column and the next column of the same row and the processing circuits adjacent to the previous row and the next row of the same row, respectively, the number of processing circuits at intervals (simply referred to as "interval number") is 0.

As shown in fig. 5b, four rows and four columns of processing circuits can be connected to form a two-dimensional Torus array, wherein each processing circuit is connected to the processing circuits of the previous row and the next row and the previous column and the next column adjacent to the processing circuit respectively, i.e. the number of intervals between which the adjacent processing circuits are connected is 0. Further, the first processing circuit in each row or column in the two-dimensional Torus array is also connected to the last processing circuit in the row or column, and the number of intervals between the end-to-end processing circuits in each row or column is 2.

As shown in fig. 5c, four rows and four columns of processing circuits may be connected to form a two-dimensional array with 0 number of spaces between adjacent processing circuits and 1 number of spaces between non-adjacent processing circuits. Specifically, the processing circuits adjacent to each other in the same row or the same column in the two-dimensional array are directly connected, that is, the number of intervals is 0, and the processing circuits not adjacent to each other in the same row or the same column are connected to the processing circuits with the number of intervals being 1. It can be seen that when a plurality of processing circuits are connected to form a two-dimensional array, there may be different numbers of spaces between processing circuits in the same row or column as shown in fig. 5b and 5 c. Similarly, in some scenarios, different numbers of intervals may be connected to the processing circuitry in the diagonal direction.

As shown in fig. 5d, with four two-dimensional Torus arrays as shown in fig. 5b, the four two-dimensional Torus arrays can be arranged at predetermined intervals and connected to form a three-dimensional Torus array. The three-dimensional Torus array is connected between layers by using a spacing mode similar to that between rows and columns on the basis of a two-dimensional Torus array. For example, the processing circuits of adjacent layers in the same row and column are first connected directly, i.e., the number of intervals is 0. Then, the first and last layers of processing circuits in the same row and column are connected, i.e., the number of intervals is 2. A three-dimensional Torus array of four layers, four rows, and four columns can be finally formed.

From the above examples, those skilled in the art will appreciate that the connection relationships of other multi-dimensional arrays of processing circuits may be formed by adding new dimensions and increasing the number of processing circuits on a two-dimensional array basis. In some application scenarios, aspects of the present disclosure may also configure logical connections to processing circuitry through the use of configuration instructions. In other words, although hard-wired connections may exist between processing circuits, aspects of the present disclosure may also selectively connect some processing circuits or selectively bypass some processing circuits through configuration instructions to form one or more logical connections. In some embodiments, the aforementioned logical connections may also be adjusted according to the requirements of the actual operation (e.g., conversion of data types). Further, aspects of the present disclosure may configure the connections of the processing circuitry for different computational scenarios, including, for example, in a matrix or in one or more closed computational loops.

Fig. 6a, 6b, 6c and 6d are schematic diagrams illustrating further various connection relationships of processing circuits according to embodiments of the present disclosure. As can be seen, fig. 6a to 6d are still another exemplary connection relationships of the multi-dimensional array formed by the plurality of processing circuits shown in fig. 5a to 5 d. In view of this, the technical details described in connection with fig. 5a to 5d also apply to what is shown in fig. 6a to 6 d.

As shown in fig. 6a, the processing circuits of the two-dimensional array include a central processing circuit located at the center of the two-dimensional array and three processing circuits respectively connected in four directions in the same row and the same column as the central processing circuit. Therefore, the number of intervals of connection between the central processing circuit and the remaining processing circuits is 0,1, and 2, respectively. As shown in fig. 6b, the processing circuits of the two-dimensional array comprise a central processing circuit located in the center of the two-dimensional array and three processing circuits in two opposite directions in the same row as the processing circuit and one processing circuit in two opposite directions in the same column as the processing circuit. Therefore, the number of intervals connected between the central processing circuit and the processing circuit in the same row is 0 and 2, respectively, and the number of intervals connected between the central processing circuit and the processing circuit in the same column is 0.

As previously illustrated in connection with fig. 5d, the multi-dimensional array formed by the plurality of processing circuits may be a three-dimensional array made up of a plurality of layers. Wherein each layer of said three-dimensional array may comprise a two-dimensional array of a plurality of said processing circuits arranged in a row direction and a column direction thereof. Further, the processing circuits located in the three-dimensional array may be connected with the remaining one or more processing circuits on the same row, the same column, the same diagonal, or a different layer in at least one of a row direction, a column direction, a diagonal direction, and a layer direction thereof in a predetermined three-dimensional interval pattern. Further, the predetermined three-dimensional spacing pattern and the number of processing circuits spaced from each other in the connection may be related to the number of layers of spacing. The connection of the three-dimensional array will be further described with reference to fig. 6c and 6 d.

Figure 6c shows a three-dimensional array of multiple rows and columns of layers formed by the connection of multiple processing circuits. Take the processing circuits located at the l-th, r-th and c-th columns (denoted as (l, r, c)) as an example, which are located at the center of the array and are connected to the processing circuit at the previous column (l, r, c-1) and the processing circuit at the next column (l, r, c +1) of the same row at the same layer, the processing circuit at the previous row (l, r-1, c) and the processing circuit at the next row (l, r +1, c) of the same column at the same layer, and the processing circuit at the previous layer (l-1, r, c) and the processing circuit at the next layer (l +1, r, c) of the different layer at the same row and the same column, respectively. Further, the number of intervals at which the processing circuit at (l, r, c) and the other processing circuits are connected in the row direction, the column direction, and the layer direction is all 0.

Fig. 6d shows a three-dimensional array when the number of intervals connecting between a plurality of processing circuits in the row direction, the column direction, and the layer direction is all 1. Taking the processing circuit located at the central position (l, r, c) of the array as an example, the processing circuit is respectively connected with the processing circuits at (l, r, c-2) and (l, r, c +2) of the front and back columns of the same row and the same column of the same layer and the processing circuits at (l, r-2, c) and (l, r +2, c) of the front and back columns of the same row and the same column and the same row. Further, the processing circuit is connected with the processing circuits at (l-2, r, c) and (l +2, r, c) of the front layer and the back layer of different layers in the same row and column. Similarly, the processing circuits at (l, r, c-3) and (l, r, c-1) of the remaining same layers, spaced by one column, are connected to each other, and the processing circuits at (l, r, c +1) and (l, r, c +3) are connected to each other. Then, the processing circuits at (l, r-3, c) and (l, r-1, c) in the same column and one row at the same layer are connected with each other, and the processing circuits at (l, r +1, c) and (l, r +3, c) are connected with each other. In addition, the processing circuits at (l-3, r, c) and (l-1, r, c) spaced one layer apart in the same row and column are connected to each other, and the processing circuits at (l +1, r, c) and (l +3, r, c) are connected to each other.

The connection relationship of the multi-dimensional array formed by the plurality of processing circuits is exemplarily described above, and the different loop structures formed by the plurality of processing circuits are further exemplarily described below with reference to fig. 7 to 8.

Fig. 7a, 7b, 7c and 7d are schematic diagrams respectively illustrating various loop structures of a processing circuit according to an embodiment of the disclosure. Depending on the application scenario, the processing circuits may be connected not only in a physical connection relationship, but also in a logical relationship configured according to the received analyzed instructions. The plurality of processing circuits may be configured to be connected using the logical connection relationship to form a closed loop.

As shown in fig. 7a, the four adjacent processing circuits are numbered sequentially as "0, 1,2 and 3". Next, the four processing circuits are sequentially connected in a clockwise direction from the processing circuit 0, and the processing circuit 3 is connected to the processing circuit 0 so that the four processing circuits are connected in series to form a closed loop (simply referred to as "loop"). In this loop, the number of intervals of processing circuits is 0 or 2, e.g., the number of intervals between

processing circuits

0 and 1 is 0, and the number of intervals between

processing circuits

3 and 0 is 2. Further, the physical addresses (which may also be referred to as physical coordinates in the context of the present disclosure) of the four processing circuits in the illustrated loop may be represented as 0-1-2-3, while their logical addresses (which may also be referred to as logical coordinates in the context of the present disclosure) may likewise be represented as 0-1-2-3. It should be noted that the connection sequence shown in fig. 7a is only exemplary and not limiting, and those skilled in the art may connect four processing circuits in series in a counterclockwise direction to form a closed loop according to the actual calculation requirement.

In some practical scenarios, when the bit width of data supported by one processing circuit cannot meet the bit width requirement of the operation data, a plurality of processing circuits can be combined into one processing circuit group to represent one data. For example, assume that one processing circuit can process 8-bit data. When 32-bit data needs to be processed, 4 processing circuits may be combined into one processing circuit group so as to connect 4 8-bit data to form one 32-bit data. Further, one processing circuit group formed of the aforementioned 4 8-bit processing circuits can serve as one processing circuit 104 shown in fig. 7b, so that an operation of a higher bit width can be supported.

As can be seen from fig. 7b, the layout of the processing circuit shown is similar to that shown in fig. 7a, but the number of intervals of connection between the processing circuits in fig. 7b is different from that in fig. 7 a. Fig. 7b shows that four processing circuits numbered sequentially with 0,1, 2 and 3 connect processing circuit 1, processing circuit 3 and processing circuit 2 sequentially in a clockwise direction starting from processing circuit 0, and processing circuit 2 connects to processing circuit 0, forming a closed loop in series. As can be seen from this loop, the number of intervals of the processing circuits shown in fig. 7b is 0 or 1, e.g. the interval between

processing circuits

0 and 1 is 0, while the interval between

processing circuits

1 and 3 is 1. Further, the physical addresses of the four processing circuits in the closed loop shown may be 0-1-2-3, while the logical addresses may be represented as 0-1-3-2 in the looped manner shown. Thus, when data of a high bit-width needs to be split to be allocated to different processing circuits, the data order can be rearranged and allocated according to the logical addresses of the processing circuits.

The splitting and rearranging operations described above may be performed by the pre-operative circuitry described in connection with fig. 3. In particular, the pre-operation circuit may rearrange the input data according to the physical and logical addresses of the plurality of processing circuits for satisfying the requirements of the data operation. Assuming that four sequentially arranged processing circuits 0 to 3 are connected as shown in fig. 7a, since the physical address and the logical address of the connection are both 0-1-2-3, the previous operation circuit may sequentially transfer the input data (e.g., pixel data) aa0, aa1, aa2, and aa3 to the corresponding processing circuits. However, when the four processing circuits are connected as shown in FIG. 7b, their physical addresses remain unchanged from 0-1 to 2-3 and their logical addresses become 0-1 to 3-2, at which time the previous operation circuit needs to rearrange the input data aa0, aa1, aa2 and aa3 into aa0-aa1-aa3-aa2 for transmission to the corresponding processing circuit. Based on the input data rearrangement, the scheme disclosed by the invention can ensure the correctness of the data operation sequence. Similarly, if the sequence of the four operation output results (e.g., pixel data) obtained as described above is bb0-bb1-bb3-bb2, the sequence of the operation output results can be restored and adjusted to bb0-bb1-bb2-bb3 by using the post-operation circuit described in conjunction with fig. 2 for ensuring the consistency of arrangement between the input data and the output result data.

Fig. 7c and 7d show that further processing circuits are arranged and connected in different ways, respectively, to form a closed loop. As shown in fig. 7c, the 16 processing circuits 104 numbered in the order of 0,1 … 15 are sequentially connected and combined every two processing circuits from the processing circuit 0 to form one processing circuit group. For example, as shown in the figure, the processing circuit 0 is connected to the processing circuit 1 to form one processing circuit group … …. By analogy, the processing circuits 14 are connected with the processing circuits 15 to form one processing circuit group, and finally eight processing circuit groups are formed. Further, the eight processing circuit groups may also be connected in a similar manner to the connection of the processing circuits described above, including being connected according to, for example, a predetermined logical address to form a closed loop of one processing circuit group.

As shown in fig. 7d, the plurality of processing circuits 104 are connected in an irregular or non-uniform manner to form a processing circuit matrix having closed loops. In particular, in fig. 7d it is shown that the processing circuits may be spaced apart by a number of 0 or 3 to form a closed loop, for example processing circuit 0 may be connected to processing circuit 1 (spaced apart by a number of 0) and processing circuit 4 (spaced apart by a number of 3), respectively.

As will be appreciated from the above description in connection with fig. 7a, 7b, 7c and 7d, the processing circuits of the present disclosure may be spaced apart by different numbers of processing circuits so as to be connected in a closed loop. When the total number of the processing circuits changes, any number of the intermediate intervals can be selected to be dynamically configured so as to be connected into a closed loop. It is also possible to combine a plurality of processing circuits into a processing circuit group and connect them into a closed loop of the processing circuit group. In addition, the connection of the plurality of processing circuits may be a hard connection method configured by hardware or a soft connection method configured by software.

Figures 8a, 8b, and 8c are schematic diagrams illustrating additional various loop structures of a processing circuit according to embodiments of the present disclosure. A plurality of processing circuits as shown in connection with fig. 6 may form a closed loop and each processing circuit in the closed loop may be configured with a respective logical address. Further, the pre-operation circuit described in conjunction with fig. 2 may be configured to split the operation data according to the type (e.g., 32-bit data, 16-bit data, or 8-bit data) and the logic address of the operation data, and respectively transfer the multiple sub-data obtained after splitting to corresponding processing circuits in the loop for subsequent operation.

The diagram in fig. 8a shows that four processing circuits are connected to form a closed loop and that the physical addresses of the four processing circuits in right to left order may be denoted as 0-1-2-3. The lower diagram of fig. 8a shows that the logical addresses of the four processing circuits in the loop described previously are represented as 0-3-1-2 in right-to-left order. For example, the processing circuit illustrated in the lower graph of FIG. 8a with a logical address of "3" has a physical address of "1" as illustrated in the upper graph of FIG. 8 a.

In some application scenarios, it is assumed that the granularity of the operation data is low 128 bits of the input data, such as the original sequence "15, 14, … … 2,1, 0" in the figure (each digit corresponds to 8-bit data), and the logical addresses of the 16 8-bit data are set to be numbered from low to high in the order of 0-15. Further, according to the logical addresses shown in the lower diagram of fig. 8a, the pre-operation circuit may encode or arrange data with different logical addresses according to different data types.

When the processing circuit operates with a data bit width of 32 bits, 4 numbers with logical addresses of (3,2,1,0), (7,6,5,4), (11,10,9,8) and (15,14,13,12), respectively, can represent the 0-3 rd 32-bit data, respectively. The pre-operation circuit may transfer the 0 th 32-bit data to a processing circuit with a logical address "0" (the corresponding physical address is "0"), may transfer the 1 st 32-bit data to a processing circuit with a logical address "1" (the corresponding physical address is "2"), may transfer the 2 nd 32-bit data to a processing circuit with a logical address "2" (the corresponding physical address is "3"), and may transfer the 3 rd 32-bit data to a processing circuit with a logical address "3" (the corresponding physical address is "1"). Through the rearrangement of the data, the subsequent operation requirement of the processing circuit is met. The mapping relationship between the logical address and the physical address of the final data is therefore (15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0) - > (11,10,9,8,7,6,5,4,15,14,13,12,3,2,1, 0).

When the processing circuit operates with data bits of 16 bits wide, 8 numbers with logical addresses (1,0), (3,2), (5,4), (7,6), (9,8), (11,10), (13,12) and (15,14) respectively can represent the 16-bit data of 0-7. The front operation circuit may transfer the 0 th and 4 th 16-bit data to a processing circuit with a logical address "0" (the corresponding physical address is "0"), may transfer the 1 st and 5 th 16-bit data to a processing circuit with a logical address "1" (the corresponding physical address is "2"), may transfer the 2 nd and 6 th 16-bit data to a processing circuit with a logical address "2" (the corresponding physical address is "3"), and may transfer the 3 rd and 7 th 16-bit data to a processing circuit with a logical address "3" (the corresponding physical address is "1"). Therefore, the mapping relationship between the logical address and the physical address of the final data is:

(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0)->(13,12,5,4,11,10,3,2,15,14,7,6,9,8,1,0)。

when the bit width of the data operated by the processing circuit is 8 bits, 16 numbers with the logic addresses of 0-15 can respectively represent 8-bit data of 0-15. According to the connection shown in fig. 8a, the pre-operation circuit can transmit the 0 th, 4 th, 8 th and 12 th 8bit data to the processing circuit with logical address "0" (corresponding to physical address "0"); the 1 st, 5 th, 9 th and 13 th 8bit data can be transmitted to the processing circuit with the logical address of "1" (the corresponding physical address is "2"); the 2 nd, 6 th, 10 th and 14 th 8bit data can be transmitted to the processing circuit with the logical address of "2" (the corresponding physical address is "3"); the 3 rd, 7 th, 11 th and 15 th 8bit data can be transmitted to the processing circuit with the logical address "3" (the corresponding physical address is "1"). Therefore, the mapping relationship between the logical address and the physical address of the final data is:

(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0)->(14,19,6,2,13,9,5,1,15,11,7,3,12,8,4,0)。

the diagram in fig. 8b shows eight sequentially numbered processing circuits 0 to 7 connected to form a closed loop and the eight processing circuits have physical addresses 0-1-2-3-4-5-6-7. The logic addresses of the eight processing circuits described above are shown in the diagram below fig. 8b as 0-7-1-6-2-5-3-4. For example, the processing circuit illustrated on fig. 8b with a physical address of "6" corresponds to the logical address illustrated under fig. 8b with a logical address of "3".

The operation shown in fig. 8b for rearranging data and then transmitting the rearranged data to the corresponding processing circuit for different data types is similar to that shown in fig. 8a, so the technical solution described in conjunction with fig. 8a is also applicable to fig. 8b, and the above data rearrangement operation process is not repeated here. Further, the connection relationship of the processing circuits shown in fig. 8b is similar to that shown in fig. 8a, but fig. 8b shows that eight processing circuits are twice as many as the processing circuits shown in fig. 8 a. Thus, in an application scenario operating according to different data types, the granularity of the operation data described in connection with FIG. 8b may be twice the granularity of the operation data described in connection with FIG. 8 a. Thus, the granularity of the operational data in this example may be low 256 bits of the input data, as opposed to the low 128 bits of granularity of the input data in the previous example, such as the original data sequence "31, 30, … …,2,1, 0" shown in the figure, each number corresponding to an 8-bit ("bit") length.

With respect to the above-mentioned original data sequence, when the bit widths of the data operated by the processing circuits are 32bit, 16bit and 8bit, respectively, the arrangement results of the data in the looped processing circuits are also shown in the figure, respectively. For example, when the bit width of the data to be operated on is 32 bits, 1 piece of 32-bit data in the processing circuit with the logical address "1" is (7,6,5,4), and the corresponding physical address of the processing circuit is "2". And when the bit width of the data to be operated is 16 bits, the 2 16-bit data in the processing circuit with the logical address of "3" is (23,22,7,6), and the corresponding physical address of the processing circuit is "6". When the bit width of the data to be operated is 8 bits, the data of 48 bits in the processing circuit with the logical address of 6 is (30,22,14,6), and the corresponding physical address of the processing circuit is 3.

The above description has been made for data operations of different data types in connection with the case where a plurality of single type processing circuits (e.g., the first type processing circuit shown in fig. 3) shown in fig. 8a and 8b are connected to form a closed loop. Further description will be made below for data operations of different data types in connection with a case where a plurality of different types of processing circuits (such as the first type of processing circuit and the second type of processing circuit shown in fig. 4) shown in fig. 8c are connected to form a closed loop.

The diagram in figure 8c shows that twenty multi-type processing circuits, numbered sequentially with 0,1 … … 19, are connected to form a closed loop (numbered as the physical addresses of the processing circuits shown in the diagram). Sixteen processing circuits numbered 0 through 15 are of a first type and four processing circuits numbered 16 through 19 are of a second type. Similarly, the physical address of each of the twenty processing circuits has a mapping relationship with the logical address of the corresponding processing circuit illustrated in the lower diagram of fig. 8 c.

Further, when operating on different data types, for example, for the original sequence of 80 8 bits shown in the figure, fig. 8c also shows the result after operating on the aforementioned original data for different data types supported by the processing circuit. For example, when the bit width of the data to be operated on is 32 bits, 1 piece of 32-bit data in the processing circuit with the logical address "1" is (7,6,5,4), and the corresponding physical address of the processing circuit is "2". And when the bit width of the data to be operated on is 16 bits, the 2 pieces of 16-bit data in the processing circuit with the logical address of "11" are (63,62,23,22), and the corresponding physical address of the processing circuit is "9". And when the bit width of the data to be operated on is 8 bits, the 4 8-bit data in the processing circuit with the logical address of "17" is (77,57,37,17), and the corresponding physical address of the processing circuit is "18".

9a, 9b, 9c, and 9d are schematic diagrams illustrating data stitching operations performed by pre-processing circuitry according to embodiments of the present disclosure. As previously mentioned, the pre-processing circuit described in connection with fig. 2 of the present disclosure may be further configured to select a data splicing mode from a plurality of data splicing modes according to the parsed instruction to perform a splicing operation on the input two data. With respect to multiple data stitching modes, in one embodiment, the disclosed scheme forms different data stitching modes by dividing and numbering two data to be stitched by a minimum data unit, and then extracting different minimum data units of the data based on a specified rule. For example, the decimation and the tiling may be performed, e.g., alternately, based on the parity of the numbers or whether the numbers are integer multiples of a specified number, thereby forming different data concatenation patterns. Depending on different calculation scenarios (e.g. different data bit widths), the minimum data unit here may be simply 1 bit or 1 bit data, or 2 bits, 4 bits, 8 bits, 16 bits or 32 bits or bit length. Further, when extracting different numbered portions of two data, the scheme of the present disclosure may extract alternately with the minimum data unit, or may extract with a multiple of the minimum data unit, for example, extract partial data of two or three minimum data units alternately from two data at a time as a group to be spliced by group.

Based on the above description of the data splicing patterns, the data splicing patterns of the present disclosure will be exemplarily explained in specific examples in conjunction with fig. 9a to 9 c. In the illustrated diagram, the input data are In1 and In2, and when each square In the diagram represents one minimum data unit, both input data have a bit width length of 8 minimum data units. As previously described, the minimum data unit may represent a different number of bits (or bits) for data of different bit width lengths. For example, for data with a bit width of 8 bits, the smallest data unit represents 1-bit data, and for data with a bit width of 16 bits, the smallest data unit represents 2-bit data. For another example, for data having a bit width of 32 bits, the minimum data unit represents 4 bits of data.

As shown In fig. 9a, the two input data to be spliced In1 and In2 are each composed of eight minimum data units numbered 1,2, … …,8 sequentially from right to left. And performing data splicing according to the odd-even interleaving principle that the serial numbers are from small to large, In1 is firstly followed by In2, and odd serial numbers are firstly followed by even serial numbers. Specifically, when the data bit width of the operation is 8 bits, the data In1 and In2 each represent one 8-bit data, and each minimum data unit represents 1-bit data (i.e., one square represents 1-bit data). According to the bit width of the data and the splicing principle, the minimum data units numbered 1, 3, 5 and 7 of the extracted data In1 are arranged In the lower order. Next, four odd-numbered minimum data cells of the data In2 are sequentially arranged. Similarly, the minimum data units of data In1 numbered 2, 4,6, and 8 and the four even-numbered minimum data units of data In2 are sequentially arranged. Finally, 1 16-bit or 2-bit new data is formed from the 16 smallest data cells, as shown by the second row of squares in fig. 9 a.

As shown In fig. 9b, when the data bit width is 16 bits, the data In1 and In2 each represent 16 bits of data, and each minimum data unit represents 2 bits of data (i.e. one square represents one 2 bits of data). According to the bit width of the data and the foregoing interleaving principle, the minimum data units numbered 1,2, 5, and 6 of the data In1 may be extracted first and arranged In the lower order. Then, the minimum data units of the data In2 numbered 1,2, 5, and 6 are sequentially arranged. Similarly, the data In1 minimum data cells numbered 3, 4, 7, and 8 and the data In2 are sequentially arranged to splice 1 32-bit or 2 16-bit new data consisting of the final 16 minimum data cells, as shown In the second row of squares In fig. 9 b.

As shown In fig. 9c, when the data bit width is 32 bits, the data In1 and In2 each represent 32 bits of data, and each minimum data unit represents 4 bits of data (i.e., one square represents one 4 bits of data). According to the bit width of the data and the aforementioned interleaving principle, the minimum data units numbered 1,2, 3 and 4 of the data In1 and numbered the same as the data In2 can be extracted first and arranged In the lower order. Then, the minimum data units numbered 5, 6, 7 and 8 of the extracted data In1 and numbered the same as the data In2 are sequentially arranged, so that 1 64-bit or 2 32-bit new data composed of the final 16 minimum data units are spliced.

Exemplary data stitching approaches of the present disclosure are described above in connection with fig. 9 a-9 c. However, it will be appreciated that in some computing scenarios, data stitching does not involve the staggered arrangement described above, but rather a simple arrangement of two data items, with the respective original data locations being maintained, such as shown in fig. 9 d. As can be seen from fig. 9d, the two data In1 and In2 do not perform the interleaving arrangement as shown In fig. 9 a-9 c, but only the last minimum data unit of the data In1 and the first minimum data unit of In2 are connected In series, thereby obtaining a new data type with increased (e.g., doubled) bit width. In some scenarios, the disclosed approach may also perform group stitching based on data attributes. For example, neuron data or weight data having the same feature map may be grouped and arranged to form a continuous portion of the stitched data.

10a, 10b, and 10c are schematic diagrams illustrating data compression operations performed by post-processing circuitry according to embodiments of the present disclosure. The compression operation may include screening the data with a mask or compressing by comparison of a given threshold with the data size. With respect to data compression operations, they may be divided and numbered by the minimum data unit as previously described. Similar to that described in connection with fig. 9 a-9 d, the minimum data unit may be, for example, 1-bit or 1-bit data, or 2-bit, 4-bit, 8-bit, 16-bit, or 32-bit or bit length. An exemplary description will be made below with respect to different data compression modes in conjunction with fig. 10a to 10 c.

As shown in fig. 10a, the original data is composed of eight squares (i.e., eight minimum data units) sequentially numbered 1,2 … …,8 from right to left, assuming that each minimum data unit can represent 1-bit data. When performing a data compression operation according to the mask, the post-processing circuitry may filter the raw data with the mask to perform the data compression operation. In one embodiment, the bit width of the mask corresponds to the number of smallest data units of the original data. For example, if the original data has 8 minimum data units, the mask bit width is 8 bits, and the minimum data unit numbered 1 corresponds to the least significant bit of the mask, and the minimum data unit numbered 2 corresponds to the second least significant bit of the mask. By analogy, the smallest data unit numbered 8 corresponds to the highest bit of the mask. In one application scenario, when the 8-bit mask is "10010011," the compression principle may be set to extract the smallest data unit in the original data corresponding to the data bit with the mask of "1. For example, the numbers of the smallest data units corresponding to a mask value of "1" are 1,2, 5, and 8. Thus, the smallest data units numbered 1,2, 5 and 8 may be extracted and arranged in order from lower to higher in number to form the compressed new data, as shown in the second line of fig. 10 a.

FIG. 10b shows the original data similar to FIG. 10a, and as can be seen in the second row of FIG. 10b, the data sequence through the post-processing circuitry maintains the original data arrangement order and content. It will thus be appreciated that the data compression of the present disclosure may also include a disabled mode or a non-compressed mode, such that no compression operation is performed as the data passes through the post-processing circuitry.

As shown in fig. 10c, the original data is composed of eight squares arranged in sequence, the number above each square indicates its number, numbered 1,2 … … 8 in order from right to left, and it is assumed that each minimum data unit can be 8-bit data. Further, the number in each square represents the decimal value of the minimum data unit. Taking the minimum data unit numbered 1 as an example, the decimal value is "8", and the corresponding 8-bit data is "00001111". When performing a data compression operation according to the threshold value, assuming that the threshold value is decimal data "8", the compression rule may be set to extract all minimum data units in the original data that are greater than or equal to the threshold value "8". Thus, the smallest data units numbered 1, 4, 7, and 8 can be extracted. Then, all the extracted minimum data units are arranged in descending order of number to obtain the final data result, as shown in the second row of fig. 10 c. FIG. 11 is a simplified flow diagram illustrating a method 1100 of performing an arithmetic operation using a computing device in accordance with an embodiment of the present disclosure. From the foregoing, it will be appreciated that the computing device herein may be the computing device described in connection with fig. 1 (including fig. 1a and 1b) -4, having processing circuit connections as shown in fig. 5-10 and supporting additional classes of operations.

As shown in fig. 11, at step 1110, method 1100 utilizes control circuitry to fetch and parse a Very Long Instruction Word (VLIW) instruction to obtain a parsed VLIW instruction, and to send the parsed VLIW instruction to a plurality of processing circuits. In one embodiment, the operands of the VLIW instruction include descriptors indicating the shape of the tensor, and in the parsing, the method 1100 further comprises determining, with the control circuitry, a storage address of data corresponding to the operand from the descriptors at step 1110. Next, at step 1120, the method 1100 concatenates the plurality of processing circuits in a one-dimensional or multi-dimensional array configuration into one or more processing circuit arrays and performs multi-threaded operations using the one or more processing circuit arrays based on the parsed VLIW instructions and the aforementioned memory addresses.

In one embodiment, the method 1100 configures the array of processing circuits to form a closed loop in at least one of the one-dimensional or multi-dimensional directions in accordance with the configuration instructions. In another embodiment, the VLIW instructions comprise one or more arithmetic instructions, and the method 1100 configures one or more processing circuit arrays to perform multi-threaded arithmetic operations in accordance with the arithmetic instructions. In one application scenario, the VLIW instructions, configuration instructions and data read-write instructions comprise respective corresponding predicates, and the method comprises configuring the control circuitry, processing circuitry and storage circuitry to determine whether to execute VLIW instructions, configuration instructions and/or data read-write instructions in dependence on the corresponding predicates.

For the sake of brevity, only the method of the present disclosure and some embodiments thereof have been described above in connection with fig. 11. Those skilled in the art can also appreciate that the method may include more steps according to the disclosure of the present disclosure, and the execution of the steps may implement various operations of the present disclosure described above in conjunction with fig. 1 to 10, which are not described herein again.

Fig. 12 is a block diagram illustrating a combined processing device 1200 according to an embodiment of the present disclosure. As shown in fig. 12, the combined processing device 1200 includes a computing processing device 1202, an interface device 1204, other processing devices 1206, and a storage device 1208. Depending on the application scenario, one or more computing devices 1210 may be included in the computing processing device and may be configured to perform the operations described herein in conjunction with fig. 1-11.

In various embodiments, the computing processing device of the present disclosure may be configured to perform user-specified operations. In an exemplary application, the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core. When multiple computing devices are implemented as artificial intelligence processor cores or as part of a hardware structure of an artificial intelligence processor core, computing processing devices of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure.

In an exemplary operation, the computing processing device of the present disclosure may interact with other processing devices through an interface device to collectively perform user-specified operations. Other Processing devices of the present disclosure may include one or more types of general and/or special purpose processors, such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), and artificial intelligence processors, depending on the implementation. These processors may include, but are not limited to, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic, discrete hardware components, etc., and the number may be determined based on actual needs. As previously mentioned, the computing processing device of the present disclosure may be considered to have a single core structure or an isomorphic multi-core structure only. However, when considered together, a computing processing device and other processing devices may be considered to form a heterogeneous multi-core structure.

In one or more embodiments, the other processing device can interface with external data and controls as a computational processing device of the present disclosure (which can be embodied as an artificial intelligence, e.g., a computing device associated with neural network operations), performing basic controls including, but not limited to, data handling, starting and/or stopping of the computing device, and the like. In further embodiments, other processing devices may also cooperate with the computing processing device to collectively perform computational tasks.

In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing processing device may obtain input data from other processing devices via the interface device, and write the input data into a storage device (or memory) on the computing processing device. Further, the computing processing device may obtain the control instruction from the other processing device via the interface device, and write the control instruction into the control cache on the computing processing device slice. Alternatively or optionally, the interface device may also read data from the memory device of the computing processing device and transmit the data to the other processing device.

Additionally or alternatively, the combined processing device of the present disclosure may further include a storage device. As shown in the figure, the storage means is connected to the computing processing means and the further processing means, respectively. In one or more embodiments, the storage device may be used to hold data for the computing processing device and/or the other processing devices. For example, the data may be data that is not fully retained within internal or on-chip storage of a computing processing device or other processing device.

In some embodiments, the present disclosure also discloses a chip (e.g., chip 1302 shown in fig. 13). In one implementation, the Chip is a System on Chip (SoC) and is integrated with one or more combinatorial processing devices as shown in fig. 12. The chip may be connected to other associated components through an external interface device, such as external interface device 1306 shown in fig. 13. The relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card, or a wifi interface. In some application scenarios, other processing units (e.g., video codecs) and/or interface modules (e.g., DRAM interfaces) and/or the like may be integrated on the chip. In some embodiments, the disclosure also discloses a chip packaging structure, which includes the chip. In some embodiments, the present disclosure also discloses a board card including the above chip packaging structure. The board will be described in detail below with reference to fig. 13.

Fig. 13 is a schematic diagram illustrating a structure of a board 1300 according to an embodiment of the present disclosure. As shown in fig. 13, the board includes a memory device 1304 for storing data, which includes one or more memory cells 1310. The memory device may be connected and data transferred to and from the control device 1308 and the chip 1302 as described above by means of, for example, a bus. Further, the board card also includes an external interface device 1306 configured for data relay or transfer functions between the chip (or chips in the chip package structure) and an external device 1312 (such as a server or a computer). For example, the data to be processed may be transferred to the chip by an external device through an external interface means. For another example, the calculation result of the chip may be transmitted back to an external device via the external interface device. According to different application scenarios, the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface or the like.

In one or more embodiments, the control device in the disclosed card may be configured to regulate the state of the chip. Therefore, in an application scenario, the control device may include a single chip Microcomputer (MCU) for controlling the operating state of the chip.

From the above description in conjunction with fig. 12 and 13, it will be understood by those skilled in the art that the present disclosure also discloses an electronic device or apparatus, which may include one or more of the above boards, one or more of the above chips and/or one or more of the above combination processing devices.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud end, an edge end, and a terminal. In one or more embodiments, a computationally powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or apparatus may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of acts and combinations thereof, but those skilled in the art will appreciate that the aspects of the present disclosure are not limited by the order of the acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in which acts or modules are involved, which are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.

In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are divided based on the logic functions, and there may be other dividing manners in actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the solution of the embodiment of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In some implementation scenarios, the integrated units may be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated units may be stored in a computer readable memory. In this regard, when aspects of the present disclosure are embodied in the form of a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described in embodiments of the present disclosure. The Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In view of this, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing may be better understood in light of the following clauses:

clause 1, a computing device comprising control circuitry and a plurality of processing circuitry, wherein:

the control circuitry is configured to fetch and parse a Very Long Instruction Word (VLIW) instruction, and to send the parsed VLIW instruction to the plurality of processing circuitry, wherein an operand of the VLIW instruction includes a descriptor indicating a shape of a tensor, and the control circuitry is configured to determine a storage address of data corresponding to the operand from the descriptor in the parsing; and

Clause 2, the computing device of clause 1, wherein the VLIW instruction comprises an identification of a descriptor and/or content of a descriptor comprising at least one shape parameter representing a shape of tensor data.

Clause 3, the computing device of clause 2, wherein the contents of the descriptor further include at least one address parameter representing an address of tensor data.

Clause 4, the computing device of clause 3, wherein the address parameters of the tensor data comprise a reference address of a data reference point of the descriptor in a data storage space of the tensor data.

Clause 5, the computing device of clause 4, wherein the shape parameters of the tensor data comprise at least one of:

the size of the data storage space in at least one of N dimensional directions, the size of a storage region of the tensor data in at least one of the N dimensional directions, the offset of the storage region in at least one of the N dimensional directions, the positions of at least two vertexes at diagonal positions of the N dimensional directions relative to the data reference point, and the mapping relationship between the data description position of the tensor data indicated by the descriptor and the data address, wherein N is an integer greater than or equal to zero.

Clause 6, the computing device of clause 1, wherein the control circuitry is further configured to obtain configuration instructions, the plurality of processing circuits being configured to connect according to the configuration instructions so as to form the one or more processing circuit arrays.

Clause 7, the computing device of clause 6, wherein the array of processing circuits is configured to form a closed loop in at least one of the one-dimensional or multi-dimensional directions in accordance with the configuration instructions.

Clause 8, the computing device according to

clause

6 or 7, wherein the control circuit comprises one or more registers storing configuration information about the processing circuit array, the control circuit configured to read the configuration information from the registers and send the configuration information to the processing circuit according to the configuration instruction so that the processing circuit is connected with the configuration information, the configuration information comprising preset position information of the processing circuits constituting the one or more processing circuit arrays, the configuration information further comprising looping configuration information about the processing circuit arrays forming a closed loop when the processing circuit array configuration forms a closed loop.

Clause 9, the computing device of clause 7, wherein the processing circuits located in the two-dimensional array are configured to connect with a remaining one or more of the processing circuits in a same row, column, or diagonal direction thereof in a predetermined two-dimensional spacing pattern so as to form one or more closed loops.

Clause 10, the computing device of clause 9, wherein the predetermined two-dimensional spacing pattern is associated with a number of processing circuits spaced in the connection.

Clause 11, the computing device of clause 7, wherein the array of processing circuits is connected in a loop of a three-dimensional array made up of a plurality of layers, wherein each layer comprises a two-dimensional array of a plurality of the processing circuits arranged in a row direction, a column direction, and a diagonal direction, and wherein:

the processing circuits located in the three-dimensional array are configured to connect with the remaining one or more processing circuits in the same row, column, diagonal, or different layers in at least one of their row, column, diagonal, and layer directions in a predetermined three-dimensional spacing pattern so as to form one or more closed loops.

Clause 12, the computing device of clause 11, wherein the predetermined three-dimensional spacing pattern is associated with a number of spaces and a number of spaced layers between processing circuits to be connected.

Clause 13, the computing device of clause 6, wherein the control circuitry is configured to send at least one of a constant term and an entry to the array of processing circuitry in accordance with the configuration instruction in order to perform the multi-threaded operation.

Clause 14, the computing device of clause 1, further comprising storage circuitry, wherein the control circuitry is further configured to obtain data read and write instructions and to send the data read and write instructions to the storage circuitry, the storage circuitry configured to perform read and write operations of data related to the multi-threaded operations according to the data read and write instructions.

Clause 15, the computing device of clause 14, wherein the data read-write instruction includes at least address information and data amount information of the data.

Clause 16, the computing device of clause 1, wherein the VLIW instruction comprises one or more arithmetic instructions, and the one or more processing circuit arrays are configured to perform multithreaded arithmetic operations in accordance with the arithmetic instructions.

Clause 17, the computing device of clause 16, wherein the plurality of processing circuit arrays are configured to each execute a different operational instruction, or at least two of the plurality of processing circuit arrays are configured to execute the same operational instruction.

Clause 18, the computing device of clause 1, further comprising data manipulation circuitry comprising pre-manipulation circuitry and/or post-manipulation circuitry, wherein the VLIW instructions further comprise pre-manipulation instructions and/or post-processing instructions, wherein the pre-manipulation circuitry is configured to perform pre-manipulation operations on input data of the multi-threaded operations according to the pre-manipulation instructions, and the post-manipulation circuitry is configured to perform post-manipulation operations on output data of the multi-threaded operations according to the post-processing instructions.

Clause 19, the computing apparatus according to any of clauses 1-18, wherein the VLIW instruction further comprises a move instruction, and the array of processing circuits is configured to perform a move operation on data between processing circuits according to the move instruction.

Clause 20, the computing device of clause 19, wherein the move instruction further comprises a mask instruction, the array of processing circuits configured to selectively move data according to the mask instruction.

Clause 21, the computing device of clause 19, wherein the move instruction further comprises register identification information for indicating a source register and a destination register to move data between processing circuits, the processing circuits configured to move data from the source register to the destination register in accordance with the register identification information.

Clause 22, the computing apparatus of clause 19, wherein the VLIW instruction, the configuration instruction, and the data read/write instruction include respective corresponding predicates, and the control circuitry, the processing circuitry, and the storage circuitry are configured to determine whether to execute the VLIW instruction, the configuration instruction, and/or the data read/write instruction based on the corresponding predicates.

Clause 23, the computing device of clause 19, wherein the VLIW instruction is combined with at least one of the configuration instruction and the data read-write instruction to form an extended VLIW instruction.

Clause 24, an integrated circuit chip comprising the computing device of any one of clauses 1-23.

Clause 25, a board comprising the integrated circuit chip of clause 24.

Clause 26, an electronic device, comprising the integrated circuit chip of clause 24.

Clause 27, a method of performing a computing operation using a computing device, wherein the computing device comprises a control circuit and a plurality of processing circuits, the method comprising:

fetching and parsing a Very Long Instruction Word (VLIW) instruction with the control circuitry, and sending the parsed VLIW instruction to the plurality of processing circuitry, wherein an operand of the VLIW instruction comprises a descriptor indicating a shape of a tensor, and the parsing comprises determining a storage address of data corresponding to the operand from the descriptor with the control circuitry; and

the plurality of processing circuits are connected in a one-dimensional or multi-dimensional array configuration into one or more processing circuit arrays, and multi-threaded operations are performed using the one or more processing circuit arrays based on the resolved VLIW instructions and the memory address.

Clause 28, the method of clause 27, wherein the VLIW instruction comprises an identification of a descriptor and/or content of a descriptor comprising at least one shape parameter representing a shape of tensor data.

Clause 29, the method of clause 28, wherein the contents of the descriptor further comprise at least one address parameter representing an address of tensor data.

Clause 30, the method of clause 29, wherein the address parameters of the tensor data comprise a reference address of a data reference point of the descriptor in a data storage space of the tensor data.

Clause 31, the method of clause 30, wherein the shape parameters of the tensor data comprise at least one of:

Clause 32, the method of clause 27, wherein the control circuit is utilized to obtain configuration instructions and the plurality of processing circuits are configured to be connected according to the configuration instructions to form the one or more processing circuit arrays.

Clause 33, the method of clause 32, wherein the array of processing circuits is configured to form a closed loop in at least one of the one-dimensional or multi-dimensional directions in accordance with the configuration instructions.

Clause 34, the method of clause 32 or 33, wherein the control circuit comprises one or more registers storing configuration information about the processing circuit array, the method further comprising configuring the control circuit to read the configuration information from the registers and send it to the processing circuit according to the configuration instructions for the processing circuit to connect with the configuration information, the configuration information comprising preset position information of the processing circuits making up the one or more processing circuit arrays, the configuration information further comprising looping configuration information about the processing circuit arrays forming a closed loop when the processing circuit arrays are configured to form a closed loop.

Clause 35, the method of clause 33, wherein the processing circuits located in the two-dimensional array are configured to be connected in at least one of their row, column or diagonal directions with a predetermined two-dimensional spacing pattern with the remaining one or more of the processing circuits in the same row, column or diagonal so as to form one or more closed loops.

Clause 36, the method of clause 35, wherein the predetermined two-dimensional spacing pattern is associated with a number of processing circuits spaced in the connection.

Clause 37, the method of clause 36, wherein the array of processing circuits is connected in a loop of a three-dimensional array comprised of a plurality of layers, wherein each layer comprises a two-dimensional array of a plurality of the processing circuits arranged in a row direction, a column direction, and a diagonal direction, and wherein:

Clause 38, the method of clause 37, wherein the predetermined three-dimensional spacing pattern is associated with a number of spaces and a number of spacing layers between processing circuits to be connected.

Clause 39, the method of clause 32, wherein the control circuitry is configured to send at least one of a constant term and an entry to the array of processing circuitry in accordance with the configuration instruction in order to perform the multi-threaded operation.

Clause 40, the method of clause 31, wherein the computing device further comprises storage circuitry, the method further comprising configuring the control circuitry to fetch data read and write instructions and to send the data read and write instructions to the storage circuitry, and configuring the storage circuitry to perform read and write operations of data related to the multi-threaded operations in accordance with the data read and write instructions.

Clause 41, the method of clause 40, wherein the data read/write instruction includes at least address information and data amount information of the data.

Clause 42, the method of clause 31, wherein the VLIW instruction comprises one or more arithmetic instructions, and the one or more processing circuit arrays are configured to perform multithreaded arithmetic operations in accordance with the arithmetic instructions.

Clause 43, the method of clause 42, wherein the plurality of processing circuit arrays are configured to each execute a different operational instruction, or at least two of the plurality of processing circuit arrays are configured to execute the same operational instruction.

Clause 44, the method of clause 31, further comprising data manipulation circuitry comprising pre-manipulation circuitry and/or post-manipulation circuitry, wherein the VLIW instructions further comprise pre-manipulation instructions and/or post-processing instructions, wherein the pre-manipulation circuitry is configured to perform pre-manipulation operations on input data of the multi-threaded operations according to the pre-manipulation instructions and the post-manipulation circuitry is configured to perform post-manipulation operations on output data of the multi-threaded operations according to the post-processing instructions.

Clause 45, the method according to any one of clauses 31-44, wherein the VLIW instruction further comprises a move instruction, and the method comprises configuring the array of processing circuits to perform a move operation on data between processing circuits according to the move instruction.

Clause 46, the method of clause 45, wherein the move instruction further comprises a mask instruction, the method comprising configuring the array of processing circuits to selectively move data according to the mask instruction.

Clause 47, the method of clause 46, wherein the move instruction further comprises register identification information for indicating a source register and a target register to move data between processing circuits, the method comprising configuring the processing circuits to move data from the source register to the target register in accordance with the register identification information.

Clause 48, the method of clause 46, wherein the VLIW instruction, the configuration instruction, and the data read/write instruction include respective corresponding predicates, and the method comprises configuring the control circuitry, the processing circuitry, and the storage circuitry to determine whether to execute the VLIW instruction, the configuration instruction, and/or the data read/write instruction based on the corresponding predicates.

Clause 49 the method of clause 46, wherein the VLIW instruction is combined with at least one of the configuration instruction and the data read-write instruction to form an extended VLIW instruction.

While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. It is intended that the following claims define the scope of the disclosure and that equivalents or alternatives within the scope of these claims be covered thereby.

Claims

1. A computing device comprising a plurality of processing circuits and a control circuit, wherein:

2. The computing device of claim 1, wherein the VLIW instruction comprises an identification of a descriptor and/or contents of a descriptor comprising at least one shape parameter representing a shape of tensor data.

3. The computing device of claim 2, wherein contents of the descriptor further include at least one address parameter representing an address of tensor data.

4. The computing device of claim 3, wherein address parameters of the tensor data comprise a base address of a data reference point of the descriptor in a data storage space of the tensor data.

5. The computing device of claim 4, wherein shape parameters of the tensor data comprise at least one of:

6. The computing device of claim 1, wherein the control circuitry is further configured to obtain configuration instructions, the plurality of processing circuits configured to connect according to the configuration instructions to form the one or more processing circuit arrays.

7. The computing device of claim 6, wherein the array of processing circuits is configured to form a closed loop in at least one of one-dimensional or multi-dimensional directions in accordance with the configuration instructions.

8. The computing device of claim 6 or 7, wherein the control circuitry comprises one or more registers storing configuration information about the processing circuitry array, the control circuitry configured to read the configuration information from the registers and send it to the processing circuitry in accordance with the configuration instructions for the processing circuitry to connect with the configuration information, the configuration information comprising preset location information of processing circuitry comprising the one or more processing circuitry arrays, the configuration information further comprising looping configuration information about the processing circuitry arrays forming a closed loop when the processing circuitry arrays are configured to form a closed loop.

9. The computing device of claim 7, wherein the processing circuits located in the two-dimensional array are configured to connect with a remaining one or more of the processing circuits in a same row, column, or diagonal in at least one of their row, column, or diagonal directions in a predetermined two-dimensional spacing pattern so as to form one or more closed loops.

10. The computing device of claim 9, wherein the predetermined two-dimensional spacing pattern is associated with a number of processing circuits spaced in the connection.

11. The computing device of claim 7, wherein the array of processing circuits is connected in a loop of a three-dimensional array of layers, wherein each layer comprises a two-dimensional array of a plurality of the processing circuits arranged in a row direction, a column direction, and a diagonal direction, and wherein:

12. The computing device of claim 11, wherein the predetermined three-dimensional spacing pattern is associated with a number of spaces and a number of layers of spaces between processing circuits to be connected.

13. The computing device of claim 6, wherein the control circuitry is configured to send at least one of a constant term and an entry to the array of processing circuitry in accordance with the configuration instruction in order to perform the multi-threaded operation.

14. The computing device of claim 1, further comprising storage circuitry, wherein the control circuitry is further configured to obtain data read and write instructions and to send the data read and write instructions to the storage circuitry, the storage circuitry configured to perform read and write operations of data related to the multi-threaded operations in accordance with the data read and write instructions.

15. The computing device of claim 14, wherein the data read and write instructions comprise at least address information and data volume information for data.

16. The computing device of claim 1, wherein the VLIW instructions comprise one or more arithmetic instructions, and the one or more processing circuit arrays are configured to perform multithreaded arithmetic operations in accordance with the arithmetic instructions.

17. The computing device of claim 16, wherein the plurality of processing circuit arrays are configured to each execute a different operational instruction, or at least two of the plurality of processing circuit arrays are configured to execute the same operational instruction.

18. The computing device of claim 1, further comprising data manipulation circuitry comprising pre-manipulation circuitry and/or post-manipulation circuitry, wherein the VLIW instructions further comprise pre-manipulation instructions and/or post-processing instructions, wherein the pre-manipulation circuitry is configured to pre-manipulate input data for the multi-threaded operations according to the pre-manipulation instructions and the post-manipulation circuitry is configured to post-manipulate output data for the multi-threaded operations according to the post-processing instructions.

19. The computing device of any of claims 1-18, wherein the VLIW instruction further comprises a move instruction, and the processing circuit array is configured to perform a move operation on data between processing circuits according to the move instruction.

20. The computing device of claim 19, wherein the move instruction further comprises a mask instruction, the processing circuit array configured to selectively move data according to the mask instruction.

21. The computing device of claim 19, wherein the move instruction further comprises register identification information for indicating a source register and a destination register to move data between processing circuits, the processing circuits being configured to move data from the source register to the destination register in accordance with the register identification information.

22. The computing device of claim 19, wherein the VLIW instructions, configuration instructions, and data read and write instructions comprise respective corresponding predicates, and the control circuitry, processing circuitry, and storage circuitry are configured to determine whether to execute VLIW instructions, configuration instructions, and/or data read and write instructions based on the corresponding predicates.

23. The computing device of claim 19, wherein the VLIW instruction is combined with at least one of the configuration instruction and a data read-write instruction to form an extended VLIW instruction.

24. An integrated circuit chip comprising the computing device of any of claims 1-23.

25. A board card comprising the integrated circuit chip of claim 24.

26. An electronic device comprising the integrated circuit chip of claim 24.

27. A method of performing a computing operation using a computing device, wherein the computing device includes a plurality of processing circuits and a control circuit, the method comprising:

28. The method of claim 27, wherein the VLIW instruction comprises an identification of a descriptor and/or contents of a descriptor comprising at least one shape parameter representing a shape of tensor data.

29. The method of claim 28, wherein the contents of the descriptor further include at least one address parameter representing an address of tensor data.

30. The method of claim 29, wherein address parameters of the tensor data comprise a reference address of a data reference point of the descriptor in a data storage space of the tensor data.

31. The method of claim 30, wherein shape parameters of the tensor data comprise at least one of:

32. The method of claim 27, wherein configuration instructions are fetched with the control circuitry and the plurality of processing circuits are configured to be connected according to the configuration instructions so as to form the one or more processing circuit arrays.

33. The method of claim 32, wherein the array of processing circuits is configured to form a closed loop in at least one of one-dimensional or multi-dimensional directions in accordance with the configuration instructions.

34. A method as claimed in claim 32 or 33, wherein said control circuitry comprises one or more registers storing configuration information about said processing circuitry array, said method further comprising configuring said control circuitry to read said configuration information from said registers and send it to said processing circuitry in accordance with said configuration instructions for said processing circuitry to connect with said configuration information, said configuration information comprising preset position information for processing circuitry comprising said one or more processing circuitry arrays, said configuration information further comprising looping configuration information about said processing circuitry arrays forming a closed loop when said processing circuitry array is configured to form a closed loop.

35. The method of claim 33, wherein the processing circuits located in the two-dimensional array are configured to be connected in at least one of their row, column or diagonal directions with a predetermined two-dimensional spacing pattern with the remaining one or more of the processing circuits in the same row, column or diagonal so as to form one or more closed loops.

36. The method of claim 35, wherein the predetermined two-dimensional spacing pattern is associated with a number of processing circuits spaced in the connection.

37. The method of claim 36, wherein the array of processing circuits is connected in a loop of a three-dimensional array of layers, wherein each layer comprises a two-dimensional array of a plurality of the processing circuits arranged in a row direction, a column direction, and a diagonal direction, and wherein:

38. The method of claim 37, wherein the predetermined three-dimensional spacing pattern is associated with a number of spaces and a number of layers of spaces between processing circuits to be connected.

39. The method of claim 32, wherein the control circuitry is configured to send at least one of a constant term and an entry to the array of processing circuitry in accordance with the configuration instruction in order to perform the multi-threaded operation.

40. The method of claim 31, wherein the computing device further comprises storage circuitry, the method further comprising configuring the control circuitry to fetch data read and write instructions and to send the data read and write instructions to the storage circuitry, and configuring the storage circuitry to perform read and write operations of data related to the multi-threaded operations in accordance with the data read and write instructions.

41. The method of claim 40, wherein the data read/write instruction comprises at least address information and data amount information of the data.

42. The method of claim 31, wherein the VLIW instructions comprise one or more arithmetic instructions and the one or more processing circuit arrays are configured to perform multithreaded arithmetic operations in accordance with the arithmetic instructions.

43. The method of claim 42, wherein the plurality of processing circuit arrays are configured to each execute a different operational instruction or at least two of the plurality of processing circuit arrays are configured to execute the same operational instruction.

44. The method of claim 31, further comprising data manipulation circuitry comprising pre-manipulation circuitry and/or post-manipulation circuitry, wherein the VLIW instructions further comprise pre-manipulation instructions and/or post-processing instructions, wherein the pre-manipulation circuitry is configured to pre-manipulate input data for the multi-threaded operations according to the pre-manipulation instructions and the post-manipulation circuitry is configured to post-manipulate output data for the multi-threaded operations according to the post-processing instructions.

45. The method of any of claims 31-44, wherein the VLIW instruction further comprises a move instruction, and the method comprises configuring the array of processing circuits to perform move operations on data between processing circuits according to the move instruction.

46. The method of claim 45, wherein the move instruction further comprises a mask instruction, the method comprising configuring the processing circuit array to selectively move data according to the mask instruction.

47. A method according to claim 46, wherein the move instruction further comprises register identification information for indicating a source register and a target register for moving data between processing circuits, the method comprising configuring the processing circuits to move data from the source register to the target register in dependence on the register identification information.

48. The method of claim 46, wherein the VLIW instructions, configuration instructions, and data read and write instructions comprise respective corresponding predicates, and the method comprises configuring the control circuitry, processing circuitry, and storage circuitry to determine whether to execute VLIW instructions, configuration instructions, and/or data read and write instructions in accordance with the corresponding predicates.

49. The method of claim 46, wherein the VLIW instruction is combined with at least one of the configuration instruction and a data read-write instruction to form an extended VLIW instruction.