CN113867799A

CN113867799A - Computing device, integrated circuit chip, board card, electronic equipment and computing method

Info

Publication number: CN113867799A
Application number: CN202010619425.6A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2021-12-31
Also published as: WO2022001497A1

Abstract

The present disclosure discloses a computing device, an integrated circuit chip, a board and a method for performing an arithmetic operation using the aforementioned computing device. Where the computing device may be included in a combined processing device that may also include a general purpose interconnect interface and other processing devices. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing means may further comprise storage means connected to the device and the other processing means, respectively, for storing data of the device and the other processing means. The scheme disclosed by the invention can improve the operation efficiency of operation in various data processing fields including, for example, the artificial intelligence field, thereby reducing the overall overhead and cost of operation.

Description

Computing device, integrated circuit chip, board card, electronic equipment and computing method

Technical Field

The present disclosure relates generally to the field of computing. More particularly, the present disclosure relates to a computing device, an integrated circuit chip, a board, an electronic apparatus, and a computing method.

Background

In computing systems, an instruction set is a set of instructions for performing computations and controlling the computing system, and plays a critical role in improving the performance of a computing chip (e.g., a processor) in the computing system. Various types of computing chips (particularly those in the field of artificial intelligence) currently utilize associated instruction sets to perform various general or specific control operations and data processing operations. However, current instruction sets suffer from a number of drawbacks. For example, existing instruction sets are limited to hardware architectures and perform poorly in terms of flexibility. Further, many instructions can only complete a single operation, and multiple operations often require multiple instructions to be performed, potentially leading to increased on-chip I/O data throughput. In addition, current instructions have improvements in execution speed, execution efficiency, and power consumption for the chip.

In addition, the arithmetic instructions of a conventional processor CPU are designed to be able to perform basic single data scalar arithmetic operations. Here, a single data operation refers to an instruction where each operand is a scalar datum. However, in tasks such as image processing and pattern recognition, the oriented operands are often data types of multidimensional vectors (i.e., tensor data), and the operation tasks cannot be efficiently performed by hardware using only scalar operations. Therefore, how to efficiently execute multidimensional tensor operation is also an urgent problem to be solved in the current computing field.

Disclosure of Invention

To solve at least the above problems in the prior art, the present disclosure provides a hardware architecture having one or more sets of pipelined arithmetic circuits supporting multi-stage pipelined arithmetic. By utilizing the hardware architecture to execute computing instructions, aspects of the present disclosure may achieve technical advantages in a number of respects, including enhancing processing performance of hardware, reducing power consumption, increasing execution efficiency of computing operations, and avoiding computing overhead. Further, the disclosed solution supports efficient access and processing of tensor data on the basis of the aforementioned hardware architecture, thereby accelerating tensor operations and reducing computation overhead brought by tensor operations in the case that multidimensional vector operands are included in computation instructions.

In a first aspect, the present disclosure provides a computing device comprising: one or more sets of pipelined arithmetic circuits configured to perform a plurality of stages of pipelined arithmetic operations based on a plurality of arithmetic instructions obtained by parsing a computation instruction, wherein each set of the pipelined arithmetic circuits constitutes a multistage arithmetic pipeline and the multistage arithmetic pipeline includes a plurality of arithmetic circuits arranged in stages, wherein an operand of the computation instruction includes a descriptor indicating a shape of a tensor and the descriptor is used to determine a storage address of data corresponding to the operand,

wherein in response to receiving a plurality of arithmetic instructions, at least one stage of arithmetic circuitry in the multi-stage arithmetic pipeline is configured to execute a corresponding one of the plurality of arithmetic instructions based on the storage address.

In a second aspect, the present disclosure provides an integrated circuit chip comprising a computing device as described above and in a number of embodiments below.

In a third aspect, the present disclosure provides a board card comprising an integrated circuit chip as described above and in the following embodiments.

In a fourth aspect, the present disclosure provides an electronic device comprising an integrated circuit chip as described above and in a number of embodiments below.

In a fifth aspect, the present disclosure provides a method of performing a calculation using the aforementioned calculation apparatus, wherein the calculation apparatus comprises one or more sets of pipelined arithmetic circuits, the method comprising: configuring each of the one or more sets of pipelined arithmetic circuits to perform a multistage pipelined operation according to a plurality of arithmetic instructions obtained by parsing a computation instruction, wherein each set of pipelined arithmetic circuits constitutes a multistage arithmetic pipeline including a plurality of arithmetic circuits arranged in stages, wherein an operand of the computation instruction includes a descriptor indicating a shape of a tensor, and the descriptor is used for determining a storage address of data corresponding to the operand; and in response to receiving a plurality of arithmetic instructions, configuring at least one stage of arithmetic circuitry in the multi-stage arithmetic pipeline to execute a corresponding one of the plurality of arithmetic instructions based on the storage address.

By utilizing the computing device, integrated circuit chip, board card, electronic device and method of the present disclosure, pipelined operations, particularly various multi-stage pipelined operations in the field of artificial intelligence, can be performed efficiently. Further, the disclosed scheme may implement efficient arithmetic operations by means of a unique hardware architecture, thereby improving the overall performance of the hardware and reducing computational overhead.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the drawings, several embodiments of the disclosure are illustrated by way of example and not by way of limitation, and like or corresponding reference numerals indicate like or corresponding parts and in which:

FIG. 1a is a block diagram illustrating a computing device according to one embodiment of the present disclosure;

FIG. 1b is a schematic diagram illustrating a data storage space according to one embodiment of the present disclosure;

FIG. 2 is a block diagram illustrating a computing device according to another embodiment of the present disclosure;

3a, 3b and 3c are schematic diagrams illustrating matrix transformations performed by data transformation circuitry according to embodiments of the present disclosure;

FIG. 4 is a block diagram illustrating a computing system in accordance with embodiments of the present disclosure;

FIG. 5 is a simplified flow diagram illustrating a method of performing an arithmetic operation using a computing device in accordance with an embodiment of the present disclosure;

FIG. 6 is a block diagram illustrating a combined treatment device according to an embodiment of the present disclosure; and

fig. 7 is a schematic diagram illustrating a structure of a board according to an embodiment of the disclosure.

Detailed Description

The disclosed solution provides a hardware architecture that supports multi-level pipelined arithmetic. When the hardware architecture is implemented in a computing device, the computing device includes at least one or more sets of pipelined arithmetic circuits, wherein each set of pipelined arithmetic circuits may constitute a multi-stage arithmetic pipeline of the present disclosure. In the multistage operation pipeline, a plurality of operation circuits may be arranged stage by stage. In one embodiment, when the computing device of the present disclosure performs a computing operation involving a tensor, an operand of the computing instruction may include a descriptor indicating a shape of the tensor, the descriptor being used to determine a storage location of data corresponding to the operand. Based on this, when a plurality of operation instructions are received, at least one stage of the operation circuit in the aforementioned multistage operation pipeline may be configured to execute a corresponding one of the plurality of operation instructions according to the aforementioned physical address. By means of the hardware architecture and the operation instructions, the parallel pipelining operation can be executed efficiently, application scenarios of calculation are expanded, and calculation overhead is reduced.

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

FIG. 1a is a block diagram illustrating a computing device 100 according to one embodiment of the present disclosure. As shown in FIG. 1a, the computing device 100 may include one or more sets of pipelined arithmetic circuits, such as the 1 st set of pipelined arithmetic circuits 102, the 2 nd set of pipelined arithmetic circuits 104, and the 3 rd set of pipelined arithmetic circuits 106 shown, where each of the sets of pipelined arithmetic circuits may constitute a multi-stage arithmetic pipeline in the context of this disclosure. Taking the 1 st group of pipelined arithmetic circuits 102 constituting the 1 st multistage arithmetic pipeline as an example, it can execute pipelined arithmetic including 1 st-1 st stage pipelined arithmetic, 1 st-2 nd stage pipelined arithmetic, and 1 st-3 rd stage pipelined arithmetic … … 1 st-N th stages pipelined arithmetic, totaling N stages pipelined arithmetic. Similarly, the group 2 and group 3 pipeline operation circuits also have a structure supporting N-stage pipeline operations. With such an exemplary architecture, those skilled in the art will appreciate that the multiple pipelined arithmetic circuits of the present disclosure may constitute multiple multistage arithmetic pipelines, and that the multiple multistage arithmetic pipelines may execute respective multiple arithmetic instructions in parallel. In one embodiment, the aforementioned operation instruction may be obtained by parsing a calculation instruction. According to aspects of the present disclosure, an operand of the computation instruction may include a descriptor indicating a shape of a tensor, and the descriptor is used to determine a storage address of data corresponding to the operand.

In order to perform the above-described per-stage pipelined operation, an arithmetic circuit including one or more operators may be arranged at each stage to execute a corresponding arithmetic instruction to implement an arithmetic operation at the stage. When the arithmetic operation involving the tensor is performed in the arithmetic instruction, at least one stage of the arithmetic circuit in the multistage arithmetic pipeline may be configured to execute the corresponding one of the arithmetic instructions based on the storage address. In one embodiment, in response to receiving multiple operation instructions, one or more sets of pipelined arithmetic circuits of the present disclosure may be configured to perform multiple data operations, such as executing single instruction multiple data ("SIMD") instructions. In one embodiment, the aforementioned plurality of operation instructions may be obtained by parsing a computation instruction received by the computing apparatus 100, and the operation code of the computation instruction may represent a plurality of operations performed by the multistage operation pipeline. In another embodiment, the operation code and the operations represented by it are predetermined according to functions supported by a plurality of arithmetic circuits arranged stage by stage in a multistage arithmetic pipeline.

In the solution of the present disclosure, each group of pipelined arithmetic circuits may be configured to perform a step-by-step arithmetic operation in a multistage arithmetic pipeline constituted by the pipelined arithmetic circuits, and may be configured to be selectively connected according to a plurality of arithmetic instructions to complete a corresponding plurality of arithmetic instructions. In one implementation scenario, the plurality of multi-stage operation pipelines of the present disclosure may include a first multi-stage operation pipeline and a second multi-stage operation pipeline, wherein an output of an operational circuit of one or more stages of the first multi-stage operation pipeline is configured to be connected to an input of an operational circuit of one or more stages of the second multi-stage operation pipeline according to the operational instruction. For example, the 1 st to 2 nd stage pipeline operations in the 1 st multistage operation pipeline shown in the figure may input their operation results into the 2 nd to 3 rd stage pipeline operations in the 2 nd multistage operation pipeline according to the operation instruction. Similarly, the 2 nd-1 st stage pipeline operations in the 2 nd multistage operation pipeline shown in the figure may input their operation results to the 3 rd-3 rd stage pipeline operations in the 3 rd multistage operation pipeline according to the operation instruction. In some scenarios, depending on the operational instructions, two-stage pipelining operations in different pipelines may implement bidirectional transfer of operational results, e.g., between the 2 nd-2 nd stage pipelining operations in the 2 nd multi-stage operational pipeline and the 3 rd-2 nd stage pipelining operations in the 3 rd multi-stage operational pipeline as shown.

From the foregoing, it can be seen that in order to achieve data transfer between the same operation pipeline and different operation pipelines, each stage of operation circuit in the disclosed multi-group operation pipelines may have an input end and an output end for receiving input data at the operation circuit and outputting the result of operation of the stage of operation circuit. In a multistage operation pipeline, the output end of the operation circuit of one or more stages is configured to be connected to the input end of the operation circuit of another stage or another stages according to an operation instruction to execute the operation instruction, for example, in a1 st operation pipeline, the result of a1 st-1 st stage pipeline operation can be input to a1 st-3 rd stage pipeline operation in the operation pipeline according to the operation instruction.

In the context of the present disclosure, the aforementioned operation instructions may be micro instructions or control signals executed within a computing device (or processing circuit, processor), which may include (or indicate) one or more operation operations to be executed by the computing device. The operation may include, but is not limited to, various operations such as an addition operation, a multiplication operation, a convolution operation, a pooling operation, and the like according to different operation scenarios. To implement a multi-stage pipelined operation, each stage of operational circuitry that performs each stage of pipelined operation may include, but is not limited to, one or more of the following operators or circuits: a random number processing circuit, an addition-subtraction circuit, a table look-up circuit, a parameter configuration circuit, a multiplier, a pooling device, a comparator, an absolute value calculating circuit, a logic arithmetic device, a position index circuit or a filter. Here, a pooling device is taken as an example, which may be exemplarily constituted by an adder, a divider, a comparator, and the like, so as to perform a pooling operation in the neural network.

In order to realize the multi-stage pipeline operation, the present disclosure may further provide corresponding calculation instructions according to operations supported by the operation circuit in the multi-stage pipeline operation, so as to realize the multi-stage pipeline operation. Depending on the operational scenario, the computational instructions of the present disclosure may include a plurality of opcodes, which may represent a plurality of operations performed by an arithmetic circuit. For example, when N ═ 4 in fig. 1 (i.e., when performing a 4-stage pipelining operation), the calculation instruction according to the disclosed aspect may be represented by the following equation (1):

Result＝((((scr0 op0 scr1)op1 src2)op2 src3)op3 src4) (1)

wherein, scr0 src4 are source operands, and op0 op3 are operation codes. The type, order, and number of opcodes of the disclosed compute instructions may vary according to different pipelined arithmetic circuit architectures and supported operations. In one embodiment, when the calculation operation involves an operation of a tensor, one of the source operands described above may include a descriptor indicating a shape of the tensor, so that the descriptor may be used to determine a storage address of data corresponding to the operand.

In some application scenarios, the multi-stage pipelined operation of the present disclosure may support unary operations (i.e., situations where there is only one input datum). Taking the operation at scale layer + relu layer in the neural network as an example, assume that the computation instruction to be executed is expressed as result ═ relu (a × ina + b), where ina is input data (which may be a vector or a matrix, for example), and a and b are both operation constants. For the compute instruction, a set of three-stage pipelined arithmetic circuits of the present disclosure including multipliers, adders, non-linear operators may be applied to perform the operation. Specifically, the multiplier of the first stage pipeline may be utilized to calculate the product of the input data ina and a to obtain the first stage pipeline operation result. Then, the adder of the second stage of the pipeline can be used for performing addition operation on the first stage of the pipeline operation result (a × ina) and b to obtain a second stage of the pipeline operation result. Finally, the second stage pipeline operation result (a × ina + b) may be activated by using a relu activation function of the third stage pipeline to obtain a final operation result.

In some application scenarios, the multi-stage pipeline operation circuit of the present disclosure may support binary operation (e.g., convolution instruction result ═ conv (ina, inb)) or ternary operation (e.g., convolution instruction result ═ conv (ina, inb, bias)), where the input data ina, inb, and bias may be vectors (e.g., integer, fixed-point, or floating-point data) or matrices. Here, taking a convolution calculation instruction result ═ conv (ina, inb) as an example, the convolution operation expressed by the calculation instruction may be performed by using a plurality of multipliers, at least one addition tree, and at least one nonlinear operator included in a three-stage pipeline operation circuit structure, where two input data ina and inb may be, for example, neuron data. Specifically, first, a calculation may be performed by using a first-stage pipeline multiplier in the three-stage pipeline arithmetic circuit, so that a first-stage pipeline arithmetic result product ═ ina × inb (considered as one microinstruction in the arithmetic instruction, which corresponds to a multiplication operation) may be obtained. And then, the addition tree in the second-stage pipeline operation circuit can be utilized to perform addition operation on the first-stage pipeline operation result product so as to obtain a second-stage pipeline operation result sum. And finally, activating the sum by using a nonlinear arithmetic unit of the third-stage pipeline arithmetic circuit, thereby obtaining a final convolution operation result.

In some application scenarios, as described above, the solution of the present disclosure may perform a bypass operation on one or more stages of the pipelined arithmetic circuit that will not be used in the arithmetic operation, i.e., one or more stages of the pipelined arithmetic circuit may be selectively used according to the needs of the arithmetic operation without having the arithmetic operation go through all of the pipelined operations. Taking the arithmetic operation for calculating the euclidean distance as an example, assuming that the calculation instruction is expressed as dis ═ sum ((ina-inb) ^2), the arithmetic operation can be performed using only several stages of pipelined arithmetic circuits composed of adders, multipliers, addition trees, and accumulators to obtain the final arithmetic result, and the pipelined arithmetic circuits that are not used can be bypassed before or during the pipelined arithmetic operation.

As previously described, the multi-stage pipelined operation of the present disclosure further includes using the descriptors to obtain information about tensor shape correlation to determine storage addresses of tensor data, thereby obtaining and saving the tensor data by the aforementioned storage addresses.

In one possible implementation, the shape of the N-dimensional tensor data may be indicated by a descriptor, N being a positive integer, e.g., N ═ 1, 2, or 3, or zero. The tensor can include various forms of data composition, the tensor can be of different dimensions, for example, a scalar can be regarded as a 0-dimensional tensor, a vector can be regarded as a 1-dimensional tensor, and a matrix can be a 2-dimensional tensor or a tensor with more than 2 dimensions. The shape of the tensor includes information such as the dimensions of the tensor, the sizes of the dimensions of the tensor, and the like. For example, for a tensor:

the shape of the tensor can be described by a descriptor as (2, 4), i.e. the tensor is represented by two parameters as a two-dimensional tensor, with the size of the first dimension (column) of the tensor being 2 and the size of the second dimension (row) being 4. It should be noted that the manner in which the descriptors indicate the tensor shape is not limited in the present application.

In one possible implementation, the value of N may be determined according to the dimension (order) of the tensor data, or may be set according to the usage requirement of the tensor data. For example, when the value of N is 3, the tensor data is three-dimensional tensor data, and the descriptor may be used to indicate the shape (e.g., offset, size, etc.) of the three-dimensional tensor data in three dimensional directions. It should be understood that the value of N can be set by those skilled in the art according to practical needs, and the disclosure does not limit this.

In one possible implementation, the descriptor may include an identification of the descriptor and/or the content of the descriptor. The identifier of the descriptor is used to distinguish the descriptor, for example, the identifier of the descriptor may be its number; the content of the descriptor may include at least one shape parameter representing a shape of the tensor data. For example, the tensor data is 3-dimensional data, of three dimensions of the tensor data, in which shape parameters of two dimensions are fixed, the content of the descriptor thereof may include a shape parameter representing another dimension of the tensor data.

In one possible implementation, the identity and/or content of the descriptor may be stored in a descriptor storage space (internal memory), such as a register, an on-chip SRAM or other media cache, or the like. The tensor data indicated by the descriptors may be stored in a data storage space (internal memory or external memory), such as an on-chip cache or an off-chip memory, etc. The present disclosure does not limit the specific locations of the descriptor storage space and the data storage space.

In one possible implementation, the identity, content, and tensor data indicated by the descriptors may be stored in the same block of internal memory, e.g., a contiguous block of on-chip cache may be used to store the relevant content of the descriptors at addresses ADDR0-ADDR 1023. The addresses ADDR0-ADDR63 can be used as a descriptor storage space to store the identifier and content of the descriptor, and the addresses ADDR64-ADDR1023 can be used as a data storage space to store tensor data indicated by the descriptor. In the descriptor memory space, the identifiers of the descriptors may be stored with addresses ADDR0-ADDR31, and the contents of the descriptors may be stored with addresses ADDR32-ADDR 63. It should be understood that the address ADDR is not limited to 1 bit or one byte, and is used herein to mean one address, which is a unit of one address. The descriptor storage space, the data storage space, and their specific addresses may be determined by those skilled in the art in practice, and the present disclosure is not limited thereto.

In one possible implementation, the identity of the descriptors, the content, and the tensor data indicated by the descriptors may be stored in different areas of internal memory. For example, a register may be used as a descriptor storage space, the identifier and the content of the descriptor may be stored in the register, an on-chip cache may be used as a data storage space, and tensor data indicated by the descriptor may be stored.

In one possible implementation, where a register is used to store the identity and content of a descriptor, the number of the register may be used to represent the identity of the descriptor. For example, when the number of the register is 0, the identifier of the descriptor stored therein is set to 0. When the descriptor in the register is valid, an area in the buffer space can be allocated for storing the tensor data according to the size of the tensor data indicated by the descriptor.

In one possible implementation, the identity and content of the descriptors may be stored in an internal memory and the tensor data indicated by the descriptors may be stored in an external memory. For example, the identification and content of the descriptors may be stored on-chip, and the tensor data indicated by the descriptors may be stored under-chip.

In one possible implementation, the data address of the data storage space corresponding to each descriptor may be a fixed address. For example, separate data storage spaces may be divided for tensor data, each of which has a one-to-one correspondence with descriptors at the start address of the data storage space. In this case, the control circuit may determine the data address of the data corresponding to the operand in the data storage space based on the descriptor.

In one possible implementation, when the data address of the data storage space corresponding to the descriptor is a variable address, the descriptor may be further used to indicate an address of tensor data of the N dimension, where the content of the descriptor may further include at least one address parameter indicating the address of the tensor data. For example, the tensor data is 3-dimensional data, when the descriptor points to an address of the tensor data, the content of the descriptor may include one address parameter indicating the address of the tensor data, such as a starting physical address of the tensor data, or may include a plurality of address parameters of the address of the tensor data, such as a starting address of the tensor data + an address offset, or the tensor data is based on the address parameters of each dimension. The address parameters can be set by those skilled in the art according to practical needs, and the disclosure does not limit this.

In one possible implementation, the address parameter of the tensor data may include a reference address of a data reference point of the descriptor in a data storage space of the tensor data. Wherein the reference address may be different according to a variation of the data reference point. The present disclosure does not limit the selection of data reference points.

In one possible implementation, the base address may include a start address of the data storage space. When the data reference point of the descriptor is the first data block of the data storage space, the reference address of the descriptor is the start address of the data storage space. When the data reference point of the descriptor is data other than the first data block in the data storage space, the reference address of the descriptor is the address of the data block in the data storage space.

In one possible implementation, the shape parameters of the tensor data include at least one of: the size of the data storage space in at least one of N dimensional directions, the size of the storage area in at least one of N dimensional directions, the offset of the storage area in at least one of N dimensional directions, the positions of at least two vertices located at diagonal positions in the N dimensional directions relative to the data reference point, and the mapping relationship between the data description positions of tensor data indicated by the descriptors and the data addresses. Where the data description position is a mapping position of a point or a region in the tensor data indicated by the descriptor, for example, when the tensor data is 3-dimensional data, the descriptor may represent a shape of the tensor data using three-dimensional space coordinates (x, y, z), and the data description position of the tensor data may be a position of a point or a region in the three-dimensional space to which the tensor data is mapped, which is represented using three-dimensional space coordinates (x, y, z).

It should be understood that shape parameters representing tensor data can be selected by one skilled in the art based on practical considerations, which are not limited by the present disclosure. By using the descriptor in the data access process, the association between the data can be established, thereby reducing the complexity of data access and improving the instruction processing efficiency.

In one possible implementation, the content of the descriptor of the tensor data may be determined according to a reference address of a data reference point of the descriptor in a data storage space of the tensor data, a size of the data storage space in at least one of N dimensional directions, a size of the storage area in at least one of the N dimensional directions, and/or an offset of the storage area in at least one of the N dimensional directions.

FIG. 1b shows a schematic diagram of a data storage space according to an embodiment of the present disclosure. As shown in fig. 1b, the data storage space 21 stores a two-dimensional data in a line-first manner, which can be represented by (X, Y) (where the X axis is horizontally right and the Y axis is vertically downward), the size in the X axis direction (the size of each line) is ori _ X (not shown in the figure), the size in the Y axis direction (the total number of lines) is ori _ Y (not shown in the figure), and the starting address PA _ start (the reference address) of the data storage space 21 is the physical address of the first data block 22. The data block 23 is partial data in the data storage space 21, and its offset amount 25 in the X-axis direction is denoted as offset _ X, the offset amount 24 in the Y-axis direction is denoted as offset _ Y, the size in the X-axis direction is denoted as size _ X, and the size in the Y-axis direction is denoted as size _ Y.

In a possible implementation manner, when the descriptor is used to define the data block 23, the data reference point of the descriptor may use the first data block of the data storage space 21, and the reference address of the descriptor may be agreed as the starting address PA _ start of the data storage space 21. The content of the descriptor of the data block 23 may then be determined in combination with the size ori _ X of the data storage space 21 in the X axis, the size ori _ Y in the Y axis, and the offset amount offset _ Y of the data block 23 in the Y axis direction, the offset amount offset _ X in the X axis direction, the size _ X in the X axis direction, and the size _ Y in the Y axis direction.

In one possible implementation, the content of the descriptor can be represented using the following equation (2):

it should be understood that although the content of the descriptor is represented by a two-dimensional space in the above examples, a person skilled in the art can set the specific dimension of the content representation of the descriptor according to practical situations, and the disclosure does not limit this.

In one possible implementation manner, a reference address of a data reference point of the descriptor in the data storage space may be defined, and based on the reference address, the content of the descriptor of the tensor data is determined according to the positions of at least two vertexes located at diagonal positions in the N-dimensional directions relative to the data reference point.

For example, a reference address PA _ base of a data reference point of the descriptor in the data storage space may be agreed. For example, one data (for example, data with position (2, 2)) may be selected as a data reference point in the data storage space 21, and the physical address of the data in the data storage space may be used as the reference address PA _ base. The content of the descriptor of the data block 23 in fig. 1b can be determined from the position of the two vertices of the diagonal position relative to the data reference point. First, the positions of at least two vertices of the diagonal positions of the data block 23 relative to the data reference point are determined, for example, the positions of the diagonal position vertices relative to the data reference point in the top-left-to-bottom-right direction are used, wherein the relative position of the top-left vertex is (x _ min, y _ min), and the relative position of the bottom-right vertex is (x _ max, y _ max), and then the content of the descriptor of the data block 23 can be determined according to the reference address PA _ base, the relative position of the top-left vertex (x _ min, y _ min), and the relative position of the bottom-right vertex (x _ max, y _ max).

In one possible implementation, the content of the descriptor (with reference to PA _ base) can be represented using the following equation (3):

it should be understood that although the above examples use the vertex of two diagonal positions of the upper left corner and the lower right corner to determine the content of the descriptor, the skilled person can set the specific vertex of at least two vertices of the diagonal positions according to the actual needs, and the disclosure does not limit this.

In one possible implementation manner, the content of the descriptor of the tensor data can be determined according to a reference address of the data reference point of the descriptor in the data storage space and a mapping relation between the data description position and the data address of the tensor data indicated by the descriptor. For example, when tensor data indicated by the descriptor is three-dimensional space data, the mapping relationship between the data description position and the data address may be defined by using a function f (x, y, z).

In one possible implementation, the content of the descriptor can be represented using the following equation (4):

in one possible implementation, the descriptor is further configured to indicate an address of the N-dimensional tensor data, where the content of the descriptor further includes at least one address parameter indicating the address of the tensor data, for example, the content of the descriptor may be:

where PA is the address parameter. The address parameter may be a logical address or a physical address. The descriptor parsing circuit may obtain a corresponding data address by using PA as any one of a vertex, a middle point, or a preset point of a vector shape in combination with shape parameters in the X direction and the Y direction.

In one possible implementation, the address parameter of the tensor data includes a reference address of a data reference point of the descriptor in a data storage space of the tensor data, and the reference address includes a start address of the data storage space.

In one possible implementation, the descriptor may further include at least one address parameter representing an address of the tensor data, for example, the content of the descriptor may be:

wherein PA _ start is a reference address parameter, which is not described again.

It should be understood that, the mapping relationship between the data description location and the data address can be set by those skilled in the art according to practical situations, and the disclosure does not limit this.

In a possible implementation manner, a default base address can be set in a task, the base address is used by descriptors in instructions in the task, and shape parameters based on the base address can be included in the descriptor contents. This base address may be determined by setting an environmental parameter for the task. The relevant description and usage of the base address can be found in the above embodiments. In this implementation, the content of the descriptor can be mapped to the data address more quickly.

In one possible implementation, the reference address may be included in the content of each descriptor, and the reference address of each descriptor may be different. Compared with a mode of setting a common reference address by using environment parameters, each descriptor in the mode can describe data more flexibly and use a larger data address space.

In one possible implementation, the data address in the data storage space of the data corresponding to the operand of the processing instruction may be determined according to the content of the descriptor. The calculation of the data address is automatically completed by hardware, and the calculation methods of the data address are different when the content of the descriptor is represented in different ways. The present disclosure does not limit the specific calculation method of the data address.

For example, the content of the descriptor in the operand is expressed by formula (2), the amount of shift of the tensor data indicated by the descriptor in the data storage space is offset _ x and offset _ y, respectively, and the size is size _ x × size _ y, then the starting data address PA1 of the tensor data indicated by the descriptor in the data storage space is_(x,y)The following equation (5) may be used to determine:

PA1_(x,y)＝PA_start+(offset_y-1)*ori_x+offset_x (5)

the data start address PA1 determined according to the above equation (5)_(x,y)In combination with the offsets offset _ x and offset _ y and the sizes size _ x and size _ y of the storage area, the storage area of the tensor data indicated by the descriptor in the data storage space can be determined.

In a possible implementation manner, when the operand further includes a data description location for the descriptor, a data address of data corresponding to the operand in the data storage space may be determined according to the content of the descriptor and the data description location. In this way, a portion of the data (e.g., one or more data) in the tensor data indicated by the descriptor may be processed.

For example, the content of the descriptor in the operand is expressed by formula (2), the tensor data indicated by the descriptor is respectively offset in the data storage space by offset _ x and offset _ y, the size is size _ x × size _ y, and the data description position for the descriptor included in the operand is (x) x_q，y_q) Then, the data address PA2 of the tensor data indicated by the descriptor in the data storage space_(x,y)The following equation (6) may be used to determine:

PA2_(x,y)＝PA_start+(offset_y+y_q-1)*ori_x+(offset_x+x_q) (6)

the computing device of the present disclosure is described above with reference to fig. 1a and 1b, and by using one or more sets of pipelined arithmetic circuits in the computing device of the present disclosure, a computing instruction can be efficiently executed on the computing device to complete a multi-stage pipelined arithmetic operation, thereby improving the efficiency of computing execution and reducing the overhead of computation. In addition, by utilizing the calculation instruction to perform the operation on the tensor, the scheme of the disclosure also obviously improves the access and processing efficiency of tensor data and reduces the overhead on tensor operation.

FIG. 2 is a block diagram illustrating a computing device 200 according to another embodiment of the present disclosure. As can be seen in the figure, in addition to having two sets of pipelined

arithmetic circuits

102 and 104 identical to computing device 100, computing device 200 additionally includes a control circuit 202 and a data processing circuit 204. In one embodiment, the control circuit 202 may be configured to obtain the above-mentioned calculation instruction and parse the calculation instruction to obtain the plurality of operation instructions corresponding to the plurality of operations represented by the operation code, for example, as represented by formula (1).

In one embodiment, the data processing unit 204 may include a data conversion circuit 206 and a data stitching circuit 208. When a computing instruction includes a pre-processing operation for a pipelined arithmetic operation, such as a data conversion operation or a data splicing operation, the data conversion circuit 206 or the data splicing circuit 208 will perform the respective conversion operation or splicing operation according to the respective computing instruction. The conversion operation and the splicing operation will be described below by way of example.

In terms of data conversion operation, when the data bit width input to the data conversion circuit is high (for example, the data bit width is 1024 bit wide), the data conversion circuit may convert the input data into data with a lower bit width (for example, the bit width of the output data is 512 bit wide) according to the operation requirement. According to different application scenarios, the data conversion circuit can support conversion among multiple data types, for example, conversion among data types with different bit widths, such as FP16 (floating point number 16 bits), FP32 (floating point number 32 bits), FIX8 (fixed point number 8 bits), FIX4 (fixed point number 4 bits), FIX16 (fixed point number 16 bits), and the like can be performed. When the data input to the data conversion circuit is a matrix, the data conversion operation may be a transformation performed for the arrangement position of matrix elements. The transformation may for example comprise matrix transposition and mirroring (described later in connection with fig. 3 a-3 c), rotation of the matrix by a predetermined angle (e.g. 90 degrees, 180 degrees or 270 degrees) and transformation of the matrix dimensions.

In the case of a data splicing operation, the data splicing circuit may perform operations such as parity splicing on data blocks extracted from data according to a bit length set in an instruction, for example. For example, when the data bit length is 32 bits wide, the data splicing circuit may divide the data into 8 data blocks of 1 to 8 by a bit width length of 4 bits, then splice four data blocks of data blocks 1, 3, 5, and 7 together, and splice four data blocks of data 2, 4, 6, and 8 together for operation.

In other application scenarios, the data splicing operation may also be performed on the data M (which may be a vector, for example) obtained after performing the operation. It is assumed that the data concatenation circuit can split the lower 256 bits of the M even rows of data by using 8 bits as 1 unit data to obtain 32 even rowsUnit data (respectively denoted as M _2 i)₀To M _2i₃₁). Similarly, the lower 256 bits of the odd row of the data M can be split by using the 8-bit width as 1 unit data to obtain 32 odd row unit data (respectively denoted as M _ (2i + 1))₀To M _ (2i +1)₃₁). Further, the split 32 odd-numbered row unit data and 32 even-numbered row unit data are alternately arranged in sequence according to the sequence that the data bits are from low to high, the even-numbered row is first and the odd-numbered row is later. Specifically, even-numbered row unit data 0(M _2 i)₀) Arranged at the lower level, and then sequentially arranged with odd-numbered rows of unit data 0(M _ (2i + 1))₀). Next, even-numbered row unit data 1(M _2 i) is arranged₁) … … are provided. By analogy, when the odd-numbered row unit data 31(M _ (2i + 1))₃₁) In the arrangement of (1), 64 units of data are pieced together to form a new data 512 bits wide.

According to different application scenes, the data conversion circuit and the data splicing circuit in the data processing unit can be used in a matched mode, so that the data can be processed in a pre-processing mode or a post-processing mode more flexibly. For example, depending on the different operations included in the computing instructions, the data processing unit may perform only data conversion without performing a data splicing operation, only data splicing operation without performing data conversion, or both data conversion and data splicing operation. In some scenarios, when no pre-processing operations are included in the compute instruction for pipelined arithmetic operations, then the data processing unit may be configured to disable the data conversion circuitry and the data splicing circuitry. In other scenarios, when post-processing operations for pipelined arithmetic operations are included in the computation instruction, then the data processing unit may be configured to enable the data conversion circuitry and the data stitching circuitry to perform post-processing of the intermediate result data to obtain a final arithmetic result.

To implement data storage operations, computing device 200 also includes storage circuitry 210. In one implementation scenario, the storage circuit of the present disclosure may include a main storage module and/or a main cache module, wherein the main storage module is configured to store data for performing a multi-stage pipeline operation and an operation result after performing the operation, and the main cache module is configured to cache an intermediate operation result after performing the operation in the multi-stage pipeline operation. Further, the storage circuit may also have an interface for data transmission with an off-chip storage medium, so that data transfer between on-chip and off-chip systems may be achieved. When performing the tensor operation, the storage circuit 210 may acquire operand-corresponding data from the storage address determined by using the aforementioned descriptor, and store the resultant data to the corresponding storage address by using the descriptor after the tensor operation is finished.

Fig. 3a, 3b and 3c are schematic diagrams illustrating matrix transformations performed by data transformation circuits according to embodiments of the present disclosure. In order to better understand the conversion operation performed by the data conversion circuit 206, the transpose operation and the horizontal mirror operation performed by the original matrix will be further described below as an example.

As shown in fig. 3a, the original matrix is a matrix of (M +1) rows x (N +1) columns. Depending on the requirements of the application scenario, the data conversion circuit may perform a transpose operation conversion on the original matrix shown in fig. 3a to obtain a matrix as shown in fig. 3 b. Specifically, the data conversion circuit may swap the row number and the column number of the elements in the original matrix to form a transposed matrix. Specifically, the coordinates in the original matrix shown in fig. 3a are the element "10" in row 1, column 0, and the coordinates in the transposed matrix shown in fig. 3b are the row 0, column 1. By analogy, the coordinates in the original matrix shown in fig. 3a are the element "M0" in row M +1 and column 0, and their coordinates in the transposed matrix shown in fig. 3b are the row 0 and column M + 1.

As shown in fig. 3c, the data conversion circuit may perform a horizontal mirroring operation on the original matrix shown in fig. 3a to form a horizontal mirrored matrix. Specifically, the data conversion circuit may convert the arrangement order from the first row element to the last row element in the original matrix into the arrangement order from the last row element to the first row element by a horizontal mirroring operation, while keeping the column number of the elements in the original matrix unchanged. Specifically, the coordinates in the original matrix shown in fig. 3a are respectively the element "00" in the 0 th row and the 0 th column and the element "10" in the 1 st row and the 0 th column, and the coordinates in the horizontal mirror matrix shown in fig. 3c are respectively the 0 th column in the M +1 th row and the 0 th column in the M +1 th row. By analogy, the coordinates in the original matrix shown in FIG. 3a are the element "M0" at row M +1 and column 0, and the coordinates in the horizontal mirror matrix shown in FIG. 3c are the element "M0" at row 0 and column 0.

Based on the hardware architecture of FIG. 3 described above, the computing device of the present disclosure may execute computing instructions that include the aforementioned pre-processing and post-processing. Two illustrative examples of computational instructions according to aspects of the present disclosure are given below:

example 1: MUAD ═ FPMULT) + (FPADD/FPSUB) + (RELU) + (CONVERTFP2FIX) (7)

One of the computation instructions expressed in equation (7) above is a computation instruction that inputs 3-element operands and outputs 1-element operands, and it includes microinstructions that can be completed by a set of pipelined arithmetic circuits according to the present disclosure that include three stages of pipelined arithmetic (i.e., multiply + add/subtract + activate). Specifically, the ternary operation is A × B + C, where the micro-instructions of FPMULT perform floating-point multiply operations between operands A and B to obtain a product value, i.e., a first stage of pipelined operation. Then, the micro-instructions of the FPADD or FPSUB are executed to perform the floating-point addition or subtraction of the product value and C to obtain the sum or difference result, i.e., the second stage of the pipeline operation. Then, an activate operation RELU, i.e., a third-stage pipelined operation, may be performed on the previous-stage result. After the three-stage pipeline operation, the microinstruction covertfp 2FIX may be finally executed by the above type conversion circuit, so as to convert the type of the result data after the activation operation from a floating point number to a fixed point number, so as to be output as a final result or input as an intermediate result to a fixed point operator for further calculation operation.

Example 2: SECUCDODC ═ SEARCCH + MULT + ADD (8)

One of the computation instructions expressed in equation (8) above is a computation instruction that inputs 3-bit operands and outputs 1-bit operands, and it includes microinstructions that can be completed by a set of pipelined arithmetic circuits according to the present disclosure that include three stages of pipelined arithmetic (i.e., lookup table + multiply + add). Specifically, the ternary operation is ST (A) B + C, wherein the microinstruction of SEARCCH may be performed by the lookup circuitry in the first stage of the pipeline operation to obtain the lookup result A. The multiplication between operands A and B is then completed by a second stage of pipelined operations to obtain the product value. Then, ADD microinstructions are executed to perform the ADD operation of the product value and C to obtain the sum, i.e., the third stage pipeline operation.

As described above, the computing instruction of the present disclosure can be flexibly designed and determined according to the computing requirement, so that the hardware architecture including a plurality of computing pipelines of the present disclosure can be designed and connected according to the computing instruction and a plurality of micro instructions (or micro operations) included therein, so that a plurality of computing operations can be completed by one computing instruction, thereby improving the execution efficiency of the instruction and reducing the computing overhead.

FIG. 4 is a block diagram illustrating a computing system 400 according to an embodiment of the present disclosure. As can be seen in the figure, in addition to the computing device 200, the computing system further comprises a plurality of slave processing circuits 402 and an interconnect unit 404 for connecting the computing device 200 and the plurality of slave processing circuits 402.

In one operational scenario, the slave processing circuit of the present disclosure may operate on data in a computing device performing a pre-processing operation according to a computing instruction (implemented, for example, as one or more micro-instructions or control signals) to obtain a desired operational result. In another operational scenario, the slave processing circuit may send the intermediate result obtained after its operation (e.g., via the interconnect unit) to a data processing unit in the computing device, so that data type conversion is performed on the intermediate result by a data conversion circuit in the data processing unit or so that data splitting and stitching operations are performed on the intermediate result by a data stitching circuit in the data processing unit, thereby obtaining a final operation result.

FIG. 5 is a simplified flow diagram illustrating a method 500 of performing an arithmetic operation using a computing device in accordance with an embodiment of the present disclosure. From the foregoing, it will be appreciated that the computing device herein may be the computing device described in connection with fig. 1 (including fig. 1a and 1b) -4, having the illustrated internal connections and supporting additional classes of operations.

As shown in fig. 5, at step 502, the method 500 configures each of the one or more sets of pipelined arithmetic circuits to perform a multistage pipelined operation according to a plurality of arithmetic instructions obtained by parsing a computation instruction, wherein each set of pipelined arithmetic circuits constitutes a multistage arithmetic pipeline including a plurality of arithmetic circuits arranged in stages, wherein an operand of the computation instruction includes a descriptor indicating a shape of a tensor, and the descriptor is used to determine a storage address of data corresponding to the operand. Next, at step 504, the method 500, in response to receiving a plurality of arithmetic instructions, configures at least one stage of arithmetic circuitry in the multi-stage arithmetic pipeline to execute a corresponding one of the plurality of arithmetic instructions based on the storage address.

The calculation method of the present disclosure is described above only in conjunction with fig. 5 for the sake of simplicity. Those skilled in the art can also appreciate that the method may include more steps according to the disclosure of the present disclosure, and the execution of the steps may implement various operations of the present disclosure described above in conjunction with fig. 1 to 4, which are not described herein again.

Fig. 6 is a block diagram illustrating a combined processing device 600 according to an embodiment of the present disclosure. As shown in fig. 6, the combined processing device 600 includes a computing processing device 602, an interface device 604, other processing devices 606, and a storage device 608. Depending on the application scenario, one or more computing devices 610 may be included in the computing processing device and may be configured to perform the operations described herein in conjunction with fig. 1-5.

In various embodiments, the computing processing device of the present disclosure may be configured to perform user-specified operations. In an exemplary application, the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core. When multiple computing devices are implemented as artificial intelligence processor cores or as part of a hardware structure of an artificial intelligence processor core, computing processing devices of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure.

In an exemplary operation, the computing processing device of the present disclosure may interact with other processing devices through an interface device to collectively perform user-specified operations. Other Processing devices of the present disclosure may include one or more types of general and/or special purpose processors, such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), and artificial intelligence processors, depending on the implementation. These processors may include, but are not limited to, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic, discrete hardware components, etc., and the number may be determined based on actual needs. As previously mentioned, the computing processing device of the present disclosure may be considered to have a single core structure or an isomorphic multi-core structure only. However, when considered together, a computing processing device and other processing devices may be considered to form a heterogeneous multi-core structure.

In one or more embodiments, the other processing device can interface with external data and controls as a computational processing device of the present disclosure (which can be embodied as an artificial intelligence, e.g., a computing device associated with neural network operations), performing basic controls including, but not limited to, data handling, starting and/or stopping of the computing device, and the like. In further embodiments, other processing devices may also cooperate with the computing processing device to collectively perform computational tasks.

In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing processing device may obtain input data from other processing devices via the interface device, and write the input data into a storage device (or memory) on the computing processing device. Further, the computing processing device may obtain the control instruction from the other processing device via the interface device, and write the control instruction into the control cache on the computing processing device slice. Alternatively or optionally, the interface device may also read data from the memory device of the computing processing device and transmit the data to the other processing device.

Additionally or alternatively, the combined processing device of the present disclosure may further include a storage device. As shown in the figure, the storage means is connected to the computing processing means and the further processing means, respectively. In one or more embodiments, the storage device may be used to hold data for the computing processing device and/or the other processing devices. For example, the data may be data that is not fully retained within internal or on-chip storage of a computing processing device or other processing device.

In some embodiments, the present disclosure also discloses a chip (e.g., chip 702 shown in fig. 7). In one implementation, the Chip is a System on Chip (SoC) and is integrated with one or more combinatorial processing devices as shown in fig. 6. The chip may be connected to other associated components through an external interface device, such as external interface device 706 shown in fig. 7. The relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card, or a wifi interface. In some application scenarios, other processing units (e.g., video codecs) and/or interface modules (e.g., DRAM interfaces) and/or the like may be integrated on the chip. In some embodiments, the disclosure also discloses a chip packaging structure, which includes the chip. In some embodiments, the present disclosure also discloses a board card including the above chip packaging structure. The board will be described in detail below with reference to fig. 7.

Fig. 7 is a schematic diagram illustrating a structure of a board card 700 according to an embodiment of the disclosure. As shown in FIG. 7, the card includes a memory device 704 for storing data, which includes one or more memory cells 710. The memory device may be connected and data transferred to and from the control device 708 and the chip 702 as described above, for example, by a bus. Further, the board card further includes an external interface device 706 configured for data relay or transfer function between the chip (or the chip in the chip package structure) and an external device 712 (such as a server or a computer). For example, the data to be processed may be transferred to the chip by an external device through an external interface means. For another example, the calculation result of the chip may be transmitted back to an external device via the external interface device. According to different application scenarios, the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface or the like.

In one or more embodiments, the control device in the disclosed card may be configured to regulate the state of the chip. Therefore, in an application scenario, the control device may include a single chip Microcomputer (MCU) for controlling the operating state of the chip.

From the above description in conjunction with fig. 6 and 7, it will be understood by those skilled in the art that the present disclosure also discloses an electronic device or apparatus, which may include one or more of the above boards, one or more of the above chips and/or one or more of the above combination processing devices.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud end, an edge end, and a terminal. In one or more embodiments, a computationally powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or apparatus may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of acts and combinations thereof, but those skilled in the art will appreciate that the aspects of the present disclosure are not limited by the order of the acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in which acts or modules are involved, which are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.

In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are divided based on the logic functions, and there may be other dividing manners in actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the solution of the embodiment of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In some implementation scenarios, the integrated units may be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated units may be stored in a computer readable memory. In this regard, when aspects of the present disclosure are embodied in the form of a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described in embodiments of the present disclosure. The Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In view of this, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing may be better understood in light of the following clauses:

clause 1, a computing device, comprising:

one or more sets of pipelined arithmetic circuits configured to perform a plurality of stages of pipelined arithmetic operations based on a plurality of arithmetic instructions obtained by parsing a computation instruction, wherein each set of the pipelined arithmetic circuits constitutes a multistage arithmetic pipeline and the multistage arithmetic pipeline includes a plurality of arithmetic circuits arranged in stages, wherein an operand of the computation instruction includes a descriptor indicating a shape of a tensor and the descriptor is used to determine a storage address of data corresponding to the operand,

wherein in response to receiving the plurality of operation instructions, at least one stage of operation circuitry in the multi-stage operation pipeline is configured to execute a corresponding one of the plurality of operation instructions based on the storage address.

Clause 2, the computing device of clause 1, wherein the opcode of the computation instruction represents a plurality of operations performed by the multi-stage operation pipeline, the computing device further comprising control circuitry configured to fetch and parse the computation instruction to obtain the plurality of operation instructions corresponding to the plurality of operations, and in the parsing, the control circuitry is further configured to determine a storage address for data corresponding to the operand from the descriptor.

Clause 3, the computing apparatus of clause 2, wherein the computing instruction comprises an identification of a descriptor and/or content of a descriptor comprising at least one shape parameter representing a shape of the tensor data.

Clause 4, the computing device of clause 3, wherein the contents of the descriptor further include at least one address parameter representing an address of tensor data.

Clause 5, the computing device of clause 4, wherein the address parameters of the tensor data comprise a reference address of a data reference point of the descriptor in a data storage space of the tensor data.

Clause 6, the computing device of clause 5, wherein the shape parameters of the tensor data comprise at least one of:

the size of the data storage space in at least one of N dimensional directions, the size of a storage region of the tensor data in at least one of the N dimensional directions, the offset of the storage region in at least one of the N dimensional directions, the positions of at least two vertexes at diagonal positions of the N dimensional directions relative to the data reference point, and the mapping relationship between the data description position of the tensor data indicated by the descriptor and the data address, wherein N is an integer greater than or equal to zero.

Clause 7, the computing device of clause 2, wherein the opcode and the plurality of operations it represents are predetermined in accordance with functions supported by a plurality of arithmetic circuits arranged stage by stage in a multi-stage arithmetic pipeline.

Clause 8, the computing device of clause 1, wherein each stage of arithmetic circuitry in the multi-stage arithmetic pipeline is configured to be selectively connected in accordance with the plurality of arithmetic instructions in order to execute the plurality of arithmetic instructions.

Clause 9, the computing apparatus of clause 1, wherein the plurality of sets of pipelined arithmetic circuits form a plurality of multistage arithmetic pipelines, and the plurality of multistage arithmetic pipelines execute respective pluralities of arithmetic instructions in parallel.

Clause 10, the computing apparatus of clause 1 or 9, wherein each stage of operational circuitry in the multi-stage operational pipeline has an input and an output for receiving input data at the stage of operational circuitry and outputting results of the operation of the stage of operational circuitry.

Clause 11, the computing device according to clause 10, wherein, in a multi-stage arithmetic pipeline, the output of the arithmetic circuit of one or more stages is configured to be connected to the input of the arithmetic circuit of another stage or stages according to the arithmetic instruction to execute the arithmetic instruction.

Clause 12, the computing device of clause 10, wherein the plurality of multi-stage operation pipelines includes a first multi-stage operation pipeline and a second multi-stage operation pipeline, wherein an output of the operational circuitry of one or more stages of the first multi-stage operation pipeline is configured to be connected to an input of the operational circuitry of one or more stages of the second multi-stage operation pipeline in accordance with the operational instruction.

Clause 13, the computing device of clause 1, wherein each stage of arithmetic circuitry comprises one or more of the following operators or circuits:

a random number processing circuit, an addition-subtraction circuit, a table look-up circuit, a parameter configuration circuit, a multiplier, a pooling device, a comparator, an absolute value calculating circuit, a logic arithmetic device, a position index circuit or a filter.

Clause 14, the computing device of clause 1, further comprising data processing circuitry comprising type conversion circuitry for performing the data type conversion operation and/or data stitching circuitry for performing the data stitching operation.

Clause 15, the computing apparatus of clause 14, wherein the type conversion circuitry comprises one or more converters for effecting conversion of the computing data between the plurality of different data types.

Clause 16, the computing device of clause 14, wherein the data splicing circuit is configured to split the computing data in a predetermined bit length and splice a plurality of data blocks obtained after the splitting in a predetermined order.

Clause 17, an integrated circuit chip comprising the computing device of any one of clauses 1-16.

Clause 18, a board comprising the integrated circuit chip of clause 17.

Clause 19, an electronic device, comprising the integrated circuit chip of clause 17.

Clause 20, a method of performing a computing operation using a computing device, wherein the computing device includes one or more sets of pipelined arithmetic circuits, the method comprising:

configuring each of the one or more sets of pipelined arithmetic circuits to perform a multistage pipelined operation according to a plurality of arithmetic instructions obtained by parsing a computation instruction, wherein each set of pipelined arithmetic circuits constitutes a multistage arithmetic pipeline including a plurality of arithmetic circuits arranged in stages, wherein an operand of the computation instruction includes a descriptor indicating a shape of a tensor, and the descriptor is used for determining a storage address of data corresponding to the operand; and

in response to receiving a plurality of arithmetic instructions, at least one stage of arithmetic circuitry in the multi-stage arithmetic pipeline is configured to execute a corresponding one of the plurality of arithmetic instructions based on the storage address.

Clause 21, the method of clause 20, wherein the opcode of the computation instruction represents a plurality of operations performed by the multi-stage operation pipeline, the computing device further comprising control circuitry, the method comprising configuring the control circuitry to fetch and parse the computation instruction to obtain the plurality of operation instructions corresponding to the plurality of operations, and in the parsing, the method further comprising configuring the control circuitry to determine a storage address for data corresponding to the operand from the descriptor.

Clause 22, the method of clause 21, wherein the computing instruction comprises an identification of a descriptor and/or a content of a descriptor comprising at least one shape parameter representing a shape of the tensor data.

Clause 23, the method of clause 22, wherein the contents of the descriptor further comprise at least one address parameter representing an address of tensor data.

Clause 24, the method of clause 23, wherein the address parameters of the tensor data comprise a reference address of a data reference point of the descriptor in a data storage space of the tensor data.

Clause 25, the method of clause 24, wherein the shape parameters of the tensor data comprise at least one of:

Clause 26, the method of clause 21, wherein the opcode and the plurality of operations it represents are predetermined in accordance with functions supported by a plurality of arithmetic circuits arranged stage-by-stage in a multi-stage arithmetic pipeline.

Clause 27, the method of clause 20, wherein the operational circuits of the stages in the multi-stage operational pipeline are configured to be selectively connected in accordance with the plurality of operational instructions in order to execute the plurality of operational instructions.

Clause 28, the method of clause 20, wherein the plurality of sets of pipelined arithmetic circuits form a plurality of multistage arithmetic pipelines, and the plurality of multistage arithmetic pipelines execute respective pluralities of arithmetic instructions in parallel.

Clause 29, the method of clause 20 or 28, wherein each stage of operational circuitry in the multi-stage operational pipeline has an input and an output for receiving input data at the stage of operational circuitry and outputting results of the operation of the stage of operational circuitry.

Clause 30, the method of clause 29, wherein within a multi-stage arithmetic pipeline, the output of the arithmetic circuitry of one or more stages is configured to be connected to the input of the arithmetic circuitry of another stage or stages in accordance with the arithmetic instruction to execute the arithmetic instruction.

Clause 31, the method of clause 29, wherein the plurality of multi-stage operation pipelines includes a first multi-stage operation pipeline and a second multi-stage operation pipeline, wherein the method configures an output of an operational circuit of one or more stages of the first multi-stage operation pipeline to be connected to an input of an operational circuit of one or more stages of the second multi-stage operation pipeline according to the operational instruction.

Clause 32, the method of clause 20, wherein each stage of arithmetic circuitry comprises one or more of the following operators or circuits:

Clause 33, the method of clause 20, further comprising data processing circuitry including type conversion circuitry for performing the data type conversion operation and/or data stitching circuitry for performing the data stitching operation.

Clause 34, the method of clause 33, wherein the type conversion circuit comprises one or more converters for effecting conversion of the calculation data between the plurality of different data types.

Clause 35 the method of clause 33, wherein the data splicing circuit is configured to split the calculated data in a predetermined bit length and splice a plurality of data blocks obtained after the splitting in a predetermined order.

While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. It is intended that the following claims define the scope of the disclosure and that equivalents or alternatives within the scope of these claims be covered thereby.

Claims

1. A computing device, comprising:

2. The computing device of claim 1, wherein an opcode of the computing instruction represents a plurality of operations performed by the multi-stage operation pipeline, the computing device further comprising control circuitry configured to fetch and parse the computing instruction to obtain the plurality of operational instructions corresponding to the plurality of operations, and in the parsing, the control circuitry is further configured to determine a storage address for data corresponding to the operand from the descriptor.

3. The computing device of claim 2, wherein the computing instructions comprise an identification of a descriptor and/or content of a descriptor comprising at least one shape parameter representing a shape of tensor data.

4. The computing device of claim 3, wherein contents of the descriptor further include at least one address parameter representing an address of tensor data.

5. The computing device of claim 4, wherein address parameters of the tensor data comprise a base address of a data reference point of the descriptor in a data storage space of the tensor data.

6. The computing device of claim 5, wherein shape parameters of the tensor data comprise at least one of:

7. The computing device of claim 2, wherein the opcode and the plurality of operations it represents are predetermined according to functions supported by a plurality of arithmetic circuits arranged stage-by-stage in a multi-stage arithmetic pipeline.

8. The computing device of claim 1, wherein each stage of operational circuitry in the multi-stage operational pipeline is configured to be selectively connected in accordance with the plurality of operational instructions in order to execute the plurality of operational instructions.

9. The computing device of claim 1, wherein the plurality of sets of pipelined arithmetic circuits form a plurality of multistage arithmetic pipelines, and the plurality of multistage arithmetic pipelines execute respective pluralities of arithmetic instructions in parallel.

10. The computing device of claim 1 or 9, wherein each stage of operational circuitry in the multi-stage operational pipeline has an input and an output for receiving input data at the stage of operational circuitry and outputting results of the stage of operational circuitry operation.

11. The computing device of claim 10, wherein within a multi-stage arithmetic pipeline, the output of the arithmetic circuitry of one or more stages is configured to be connected to the input of the arithmetic circuitry of another stage or stages in accordance with an arithmetic instruction to execute the arithmetic instruction.

12. The computing device of claim 10, wherein the plurality of multi-stage operation pipelines comprises a first multi-stage operation pipeline and a second multi-stage operation pipeline, wherein an output of an operational circuit of one or more stages of the first multi-stage operation pipeline is configured to be connected to an input of an operational circuit of one or more stages of the second multi-stage operation pipeline according to the operational instruction.

13. The computing device of claim 1, wherein each stage of arithmetic circuitry comprises one or more of the following operators or circuits:

14. The computing device of claim 1, further comprising data processing circuitry comprising type conversion circuitry for performing data type conversion operations and/or data stitching circuitry for performing data stitching operations.

15. The computing device of claim 14, wherein the type conversion circuitry comprises one or more converters for enabling conversion of computing data between a plurality of different data types.

16. The computing device of claim 14, wherein the data stitching circuit is configured to split the computing data in a predetermined bit length and stitch a plurality of data blocks obtained after splitting in a predetermined order.

17. An integrated circuit chip comprising the computing device of any of claims 1-16.

18. A board card comprising the integrated circuit chip of claim 17.

19. An electronic device comprising the integrated circuit chip of claim 17.

20. A method of performing a computing operation using a computing device, wherein the computing device includes one or more sets of pipelined arithmetic circuits, the method comprising:

21. The method of claim 20, wherein the opcode of the computation instruction represents a plurality of operations performed by the multi-stage operation pipeline, the computing device further comprising control circuitry, the method comprising configuring the control circuitry to fetch and parse the computation instruction to obtain the plurality of operation instructions corresponding to the plurality of operations, and in the parsing, the method further comprising configuring the control circuitry to determine from the descriptor a storage address for the operand corresponding data.

22. The method of claim 21, wherein the computation instruction comprises an identification of a descriptor and/or content of a descriptor comprising at least one shape parameter representing a shape of tensor data.

23. The method of claim 22, wherein the content of the descriptor further comprises at least one address parameter representing an address of tensor data.

24. The method of claim 23, wherein address parameters of the tensor data comprise a reference address of a data reference point of the descriptor in a data storage space of the tensor data.

25. The method of claim 24, wherein shape parameters of the tensor data comprise at least one of:

26. The method of claim 21, wherein the opcode and the plurality of operations it represents are predetermined according to functions supported by a plurality of arithmetic circuits arranged stage-by-stage in a multi-stage arithmetic pipeline.

27. The method of claim 20, wherein each stage of operational circuitry in the multi-stage operational pipeline is configured to be selectively connected in accordance with the plurality of operational instructions in order to execute the plurality of operational instructions.

28. The method of claim 20, wherein the plurality of sets of pipelined arithmetic circuits form a plurality of multistage arithmetic pipelines, and the plurality of multistage arithmetic pipelines execute respective pluralities of arithmetic instructions in parallel.

29. A method as claimed in claim 20 or 28, wherein each stage of operational circuitry in the multistage operational pipeline has an input and an output for receiving input data at the stage of operational circuitry and outputting the result of the operation of the stage of operational circuitry.

30. The method of claim 29, wherein within a multistage arithmetic pipeline, the outputs of the arithmetic circuits of one or more stages are configured to be connected to the inputs of the arithmetic circuits of another stage or stages in accordance with an arithmetic instruction to execute the arithmetic instruction.

31. The method of claim 29, wherein the plurality of multi-stage operation pipelines comprises a first multi-stage operation pipeline and a second multi-stage operation pipeline, wherein the method configures an output of an operational circuit of one or more stages of the first multi-stage operation pipeline to be connected to an input of an operational circuit of one or more stages of the second multi-stage operation pipeline in accordance with the operational instruction.

32. The method of claim 20, wherein each stage of operational circuitry comprises one or more of the following operators or circuits:

33. The method of claim 20, further comprising data processing circuitry comprising type conversion circuitry for performing data type conversion operations and/or data stitching circuitry for performing data stitching operations.

34. The method of claim 33, wherein the type conversion circuitry comprises one or more converters for effecting conversion of the computational data between a plurality of different data types.

35. The method of claim 33, wherein the data splicing circuit is configured to split the computed data in a predetermined bit length and splice a plurality of data blocks obtained after the splitting in a predetermined order.