CN114489788A

CN114489788A - Instruction processing device, instruction processing method and related product

Info

Publication number: CN114489788A
Application number: CN202011270373.2A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2022-05-13

Abstract

The disclosure discloses an instruction processing device, an instruction processing method and a related product. The instruction processing apparatus may be implemented such that the computing apparatus is included in a combined processing apparatus, which may also include interface apparatus and other processing apparatus. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for storing data of the computing device and the other processing device. Aspects of the present disclosure provide an instruction system involving tensor data that may increase processing flexibility and improve processing efficiency of a machine.

Description

Instruction processing device, instruction processing method and related product

Technical Field

The present disclosure relates to the field of processors, and in particular, to an instruction processing apparatus, an instruction processing method, a chip, and a board.

Background

The instruction system is an interface for the interaction of computer software and hardware, and is a very important part in the structure of a computer system. With the continuous development of artificial intelligence technology, the amount of data and the data dimension which need to be processed are increasing. Therefore, how to reasonably and scientifically design the instruction can not only provide enough information, but also save the storage space, shorten the instruction fetching time and improve the performance of the machine, which is an important problem in the instruction design.

Disclosure of Invention

To address one or more of the technical problems mentioned above, the present disclosure proposes, in various aspects, an instruction system involving tensor data. By the instruction system of the present disclosure, processing flexibility may be increased, thereby increasing the processing efficiency of the machine.

In a first aspect, the present disclosure provides an instruction processing apparatus comprising: the decoding unit is configured to decode an instruction to obtain an operation code and an operand of the instruction; a dependency check unit configured to determine whether a dependency exists between the decoded two instructions; and the instruction transmitting unit is configured to determine whether the two instructions can be executed out of order or not according to the types of the operation codes of the two instructions and the dependency relationship.

In a second aspect, the present disclosure provides a chip comprising the instruction processing apparatus of any of the foregoing first aspect embodiments.

In a third aspect, the present disclosure provides a board card comprising the chip of any of the embodiments of the second aspect.

In a fourth aspect, the present disclosure provides an instruction processing method comprising: decoding an instruction to obtain an operation code and an operand of the instruction; determining whether a dependency exists between two decoded instructions; and determining whether the two instructions can be executed out of order or not according to the types of the operation codes of the two instructions and the dependency relationship.

By the instruction processing device, the instruction processing method, the chip and the board card, the embodiment of the disclosure considers the dependency relationship between the operation code types of the instructions and the instructions when judging whether the instructions can be executed out of order, so that the condition of the out-of-order execution of the instructions can be relaxed, the flexibility of processing is increased, and the processing efficiency of the machine is improved.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 shows a schematic diagram of a data storage space according to an embodiment of the present disclosure;

FIG. 2 shows a schematic diagram of data chunking in a data storage space, according to an embodiment of the present disclosure;

FIG. 3 shows a schematic block diagram of an instruction processing apparatus according to an embodiment of the present disclosure;

FIG. 4 shows a schematic diagram of an instruction stream;

FIG. 5 shows a schematic flow chart diagram of an instruction processing method according to an embodiment of the present disclosure;

FIGS. 6A and 6B illustrate an exemplary flow chart for determining whether instructions can be executed out of order according to an embodiment of the disclosure;

FIG. 7 shows a block diagram of a combined treatment device according to an embodiment of the disclosure; and

fig. 8 shows a schematic structural diagram of a board card according to an embodiment of the disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc. as may be used in the claims, the specification, and the drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Computers process various data by executing instructions. To indicate the source of data, the destination of the operation results, and the operation performed, an instruction typically contains the following information:

(1) the Operation Code (OP) is used to indicate the Operation (e.g., add, subtract, multiply, divide, data transfer, etc.) to be performed by the instruction, and specifies the nature and function of the Operation. A computer may have tens to hundreds of instructions, each with a corresponding opcode, which the computer recognizes to perform different operations.

(2) And the operand is used for describing the operation object of the instruction. Operands may relate to the data type, memory access address, addressing mode, etc. of the operated-on object. The operand may be directly given to the operated-on object, or indicate a memory address or a register address (i.e., a register name) of the operated-on object.

The instructions of conventional processors are designed to perform basic single data scalar operations. Here, a single data scalar operation refers to an instruction where each operand is a scalar datum. However, with the development of artificial intelligence technology, in tasks such as image processing and pattern recognition, the oriented operands are often data types of multidimensional vectors (i.e., tensor data), and the use of only scalar operations does not make hardware efficiently complete the operation task. Therefore, how to efficiently execute multidimensional tensor data processing is also an urgent problem to be solved in the current computing field.

In the instruction system provided by implementations of the present disclosure, descriptors are employed to indicate information of tensor data. Various possible implementations of descriptors for indicating tensor data information are described in detail below in conjunction with the figures.

Tensors may contain many forms of data composition. The tensors may be of different dimensions, e.g. a scalar may be regarded as a 0-dimensional tensor, a vector may be regarded as a 1-dimensional tensor, and a matrix may be a 2-or higher-than-2-dimensional tensor. The shape of the tensor includes information such as the dimensions of the tensor, the sizes of the dimensions of the tensor, and the like. For example, for a three-dimensional tensor:

x₃＝[[[1，2，3]，[4，5，6]]；[[7，8，9]，[10，11，12]]]

the shape or dimension of the tensor can be expressed as X₃That is, the tensor is expressed as a three-dimensional tensor by three parameters, and the size of the first dimension of the tensor is 2, the size of the second dimension of the tensor is 2, and the size of the third dimension of the tensor is 3. When storing tensor data in a memory, the shape of the tensor data cannot be determined according to the data address (or the storage area), and further, related information such as the correlation among a plurality of tensor data cannot be determined, which results in low access efficiency of the processor to the tensor data.

In one possible implementation, the shape of the N-dimensional tensor data may be indicated by a descriptor, N being a positive integer, e.g., N ═ 1, 2, or 3, or zero. The three-dimensional tensor in the above example can be represented by descriptor (2,2, 3). It should be noted that the present disclosure is not limited to the way the descriptors indicate the tensor shape.

In one possible implementation, the value of N may be determined according to the dimension (also referred to as the order) of the tensor data, or may be set according to the usage requirement of the tensor data. For example, when the value of N is 3, the tensor data is three-dimensional tensor data, and the descriptor may be used to indicate the shape (e.g., offset, size, etc.) of the three-dimensional tensor data in three dimensional directions. It should be understood that the value of N can be set by those skilled in the art according to practical needs, and the disclosure does not limit this.

Although tensor data can be multidimensional, there is a correspondence between tensors and storage on memory because the layout of memory is always one-dimensional. Tensor data is typically allocated in contiguous memory space, i.e., the tensor data can be one-dimensionally expanded (e.g., line first) for storage on memory. This relationship between the tensor and the underlying storage may be represented by an offset of a dimension (offset), a size of a dimension (size), a step size of a dimension (stride), and so on. The offset of a dimension refers to the offset in that dimension from a reference position. The size of a dimension refers to the size of the dimension, i.e., the number of elements in the dimension. The step size of a dimension refers to the interval between adjacent elements in the dimension, for example, the step size of the above three-dimensional tensor is (6,3,1), that is, the step size of the first dimension is 6, the step size of the second dimension is 3, and the step size of the third dimension is 1.

FIG. 1 shows a schematic diagram of a data storage space according to an embodiment of the present disclosure. As shown in fig. 1, the data storage space 21 stores a two-dimensional data in a line-first manner, which can be represented by (X, Y) (wherein the X-axis is horizontally to the right and the Y-axis is vertically to the bottom). The size in the X-axis direction (the size of each row, or the total number of columns) is ori _ X (not shown), the size in the Y-axis direction (the total number of rows) is ori _ Y (not shown), and the starting address PA _ start (base address) of the data storage space 21 is the physical address of the first data block 22. The data block 23 is partial data in the data storage space 21, and its offset amount 25 in the X-axis direction is denoted as offset _ X, the offset amount 24 in the Y-axis direction is denoted as offset _ Y, the size in the X-axis direction is denoted as size _ X, and the size in the Y-axis direction is denoted as size _ Y.

In a possible implementation manner, when the data block 23 is defined by using a descriptor, a data reference point of the descriptor may use a first data block of the data storage space 21, and a reference address of the descriptor may be agreed as a starting address PA _ start of the data storage space 21. The content of the descriptor of the data block 23 may then be determined in combination with the size ori _ X of the data storage space 21 in the X axis, the size ori _ Y in the Y axis, and the offset amount offset _ Y of the data block 23 in the Y axis direction, the offset amount offset _ X in the X axis direction, the size _ X in the X axis direction, and the size _ Y in the Y axis direction.

In one possible implementation, the content of the descriptor can be represented using the following formula (1):

it should be understood that although the content of the descriptor is represented by a two-dimensional space in the above examples, a person skilled in the art can set the specific dimension of the content representation of the descriptor according to practical situations, and the disclosure does not limit this.

In one possible implementation manner, a reference address of the data reference point of the descriptor in the data storage space may be appointed, and based on the reference address, the content of the descriptor of the tensor data is determined according to the positions of at least two vertexes located at diagonal positions in the N dimensional directions relative to the data reference point.

For example, a reference address PA _ base of a data reference point of the descriptor in the data storage space may be agreed. For example, one data (for example, data with position (2, 2)) may be selected as a data reference point in the data storage space 21, and the physical address of the data in the data storage space may be used as the reference address PA _ base. The content of the descriptor of the data block 23 in fig. 1 can be determined from the positions of the two vertices of the diagonal position relative to the data reference point. First, the positions of at least two vertices of the diagonal positions of the data block 23 relative to the data reference point are determined, for example, the positions of the diagonal position vertices relative to the data reference point in the top-left-to-bottom-right direction are used, wherein the relative position of the top-left vertex is (x _ min, y _ min), and the relative position of the bottom-right vertex is (x _ max, y _ max), and then the content of the descriptor of the data block 23 can be determined according to the reference address PA _ base, the relative position of the top-left vertex (x _ min, y _ min), and the relative position of the bottom-right vertex (x _ max, y _ max).

In one possible implementation, the content of the descriptor (with reference to PA _ base) can be represented using the following equation (2):

it should be understood that although the above examples use the vertex of two diagonal positions of the upper left corner and the lower right corner to determine the content of the descriptor, the skilled person can set the specific vertex of at least two vertices of the diagonal positions according to the actual needs, and the disclosure does not limit this.

In one possible implementation, the content of the descriptor of the tensor data can be determined according to the reference address of the data reference point of the descriptor in the data storage space and the mapping relation between the data description position and the data address of the tensor data indicated by the descriptor. For example, when tensor data indicated by the descriptor is three-dimensional space data, the mapping relationship between the data description position and the data address may be defined by using a function f (x, y, z).

In one possible implementation, the content of the descriptor can be represented using the following equation (3):

in one possible implementation, the descriptor is further used to indicate an address of the N-dimensional tensor data, wherein the content of the descriptor further includes at least one address parameter representing the address of the tensor data, for example, the content of the descriptor may be the following formula (4):

where PA is the address parameter. The address parameter may be a logical address or a physical address. When the descriptor is analyzed, the PA is taken as any one of the vertex, the middle point or the preset point of the vector shape, and the corresponding data address is obtained by combining the shape parameters in the X direction and the Y direction.

In one possible implementation, the address parameter of the tensor data comprises a reference address of a data reference point of the descriptor in a data storage space of the tensor data, and the reference address comprises a start address of the data storage space.

In one possible implementation, the descriptor may further include at least one address parameter representing an address of the tensor data, for example, the content of the descriptor may be the following equation (5):

wherein PA _ start is a reference address parameter, which is not described again.

It should be understood that, the mapping relationship between the data description location and the data address can be set by those skilled in the art according to practical situations, and the disclosure does not limit this.

In a possible implementation manner, a default base address may be set in a task, the base address is used by descriptors in instructions under the task, and the descriptor content may include shape parameters based on the base address. This base address may be determined by setting an environmental parameter for the task. The relevant description and usage of the base address can be found in the above embodiments. In this implementation, the contents of the descriptor can be mapped to the data address more quickly.

In one possible implementation, the reference address may be included in the content of each descriptor, and the reference address of each descriptor may be different. Compared with the mode of setting a common reference address by using the environment parameters, each descriptor in the mode can describe data more flexibly and use a larger data address space.

In one possible implementation, the data address in the data storage space of the data corresponding to the operand of the processing instruction may be determined according to the content of the descriptor. The calculation of the data address is automatically completed by hardware, and the calculation methods of the data address are different when the content of the descriptor is represented in different ways. The present disclosure does not limit the specific calculation method of the data address.

For example, the content of the descriptor in the operand is expressed by formula (1), the amount of shift of the tensor data indicated by the descriptor in the data storage space is offset _ x and offset _ y, respectively, and the size is size _ x × size _ y, then the starting data address PA1 of the tensor data indicated by the descriptor in the data storage space is_(x,y)The following equation (6) may be used to determine:

PA1_(x,y)＝PA_start+(offset_y-1)*ori_x+offset_x (6)

the data start address PA1 determined according to the above equation (6)_(x,y)In combination with the offsets offset _ x and offset _ y and the sizes size _ x and size _ y of the storage area, the storage area of the tensor data indicated by the descriptor in the data storage space can be determined.

In a possible implementation manner, when the operand further includes a data description location for the descriptor, a data address of data corresponding to the operand in the data storage space may be determined according to the content of the descriptor and the data description location. In this way, a portion of the data (e.g., one or more data) in the tensor data indicated by the descriptor may be processed.

For example, the content of the descriptor in the operand is expressed by formula (2), the tensor data indicated by the descriptor is respectively offset in the data storage space by offset _ x and offset _ y, the size is size _ x × size _ y, and the data description position for the descriptor included in the operand is (x) x_q，y_q) Then, the data address PA2 of the tensor data indicated by the descriptor in the data storage space_(x,y)The following equation (7) may be used to determine:

PA2_(x,y)＝PA_start+(offset_y+y_q-1)*ori_x+(offset_x+x_q) (7)

in one possible implementation, the descriptor may indicate the data of the block. The data partitioning can effectively accelerate the operation speed and improve the processing efficiency in many applications. For example, in graphics processing, convolution operations often use data partitioning for fast arithmetic processing.

FIG. 2 shows a schematic diagram of data chunking in a data storage space according to an embodiment of the present disclosure. As shown in FIG. 2, the data storage space 200 also stores two-dimensional data in a row-first manner, which may be represented by (X, Y) (where the X-axis is horizontally to the right and the Y-axis is vertically down). The dimension in the X-axis direction (the dimension of each row, or the total number of columns) is ori _ X (not shown), and the dimension in the Y-axis direction (the total number of rows) is ori _ Y (not shown). Unlike the tensor data of fig. 1, the tensor data stored in fig. 2 includes a plurality of data blocks.

In this case, the descriptor requires more parameters to represent the data chunks. Taking the X axis (X dimension) as an example, the following parameters may be involved: ori _ x, x.tile.size (size in tile 202), x.tile.stride (step size in tile 204, i.e. the distance between the first point of the first tile and the first point of the second tile), x.tile.num (number of tiles, shown as 3 tiles in fig. 2), x.stride (overall step size, i.e. the distance from the first point of the first row to the first point of the second row), etc. Other dimensions may similarly include corresponding parameters.

In one possible implementation, the descriptor may also indicate compression information of the relevant tensor data. For example, the descriptor may include a compression flag, such as compress _ en, that is used to mark whether the associated tensor data is compressed. Alternatively or additionally, the descriptor may also indicate the compression or encoding employed. For example, the descriptor may record the compression mode compress _ base. Alternatively or additionally, the descriptor may also indicate the compression mode or compression parameters of the encoding mode used.

In one possible implementation, the descriptor may include an identification of the descriptor and/or the content of the descriptor. The identifier of the descriptor is used to distinguish the descriptor, for example, the identifier of the descriptor may be its number; the content of the descriptor may include at least one shape parameter representing a shape of the tensor data. For example, the tensor data is 3-dimensional data, of three dimensions of which shape parameters of two dimensions are fixed, and the content of the descriptor thereof may include shape parameters representing the other dimension of the tensor data.

In one possible implementation, the identity and/or content of the descriptor may be stored in a descriptor storage space (internal memory), such as a register, an on-chip SRAM or other media cache, or the like. The tensor data indicated by the descriptors may be stored in a data storage space (internal memory or external memory), such as an on-chip cache or an off-chip memory, etc. The present disclosure does not limit the specific locations of the descriptor storage space and the data storage space.

In one possible implementation, the identity, content, and tensor data indicated by the descriptors may be stored in the same block of internal memory, e.g., a contiguous block of on-chip cache may be used to store the relevant content of the descriptors at addresses ADDR0-ADDR 1023. Here, the addresses ADDR0-ADDR63 may be used as a descriptor storage space for storing the identifier and the content of the descriptor, and the addresses ADDR64-ADDR1023 are used as a data storage space for storing tensor data indicated by the descriptor. In the descriptor memory space, the identifiers of the descriptors may be stored with addresses ADDR0-ADDR31, and addresses ADDR32-ADDR 63. It should be understood that the address ADDR is not limited to 1 bit or one byte, and is used herein to mean one address, which is a unit of one address. The descriptor storage space, the data storage space, and their specific addresses may be determined by those skilled in the art in practice, and the present disclosure is not limited thereto.

In one possible implementation, the identity of the descriptors, the content, and the tensor data indicated by the descriptors may be stored in different areas of internal memory. For example, a register may be used as a descriptor storage space, the identifier and the content of the descriptor may be stored in the register, an on-chip cache may be used as a data storage space, and tensor data indicated by the descriptor may be stored.

In one possible implementation, where a register is used to store the identity and content of a descriptor, the number of the register may be used to represent the identity of the descriptor. For example, when the number of the register is 0, the identifier of the descriptor stored therein is set to 0. When the descriptor in the register is valid, an area in the buffer space can be allocated for storing the tensor data according to the size of the tensor data indicated by the descriptor.

In one possible implementation, the identity and content of the descriptors may be stored in an internal memory and the tensor data indicated by the descriptors may be stored in an external memory. For example, the identification and content of the descriptors may be stored on-chip, and the tensor data indicated by the descriptors may be stored under-chip.

In one possible implementation, the data address of the data storage space corresponding to each descriptor may be a fixed address. For example, separate data storage spaces may be divided for tensor data, each of which has a one-to-one correspondence with descriptors at the start address of the data storage space. In this case, a circuit or module (e.g., an entity external to the disclosed computing device) responsible for parsing the computation instruction may determine the data address in the data storage space of the data corresponding to the operand from the descriptor.

In one possible implementation, when the data address of the data storage space corresponding to the descriptor is a variable address, the descriptor may be further used to indicate an address of the N-dimensional tensor data, wherein the content of the descriptor may further include at least one address parameter indicating the address of the tensor data. For example, the tensor data is 3-dimensional data, when the descriptor points to an address of the tensor data, the content of the descriptor may include one address parameter indicating the address of the tensor data, such as a starting physical address of the tensor data, or may include a plurality of address parameters of the address of the tensor data, such as a starting address of the tensor data + an address offset, or the tensor data is based on the address parameters of each dimension. The address parameters can be set by those skilled in the art according to practical needs, and the disclosure does not limit this.

In one possible implementation, the address parameter of the tensor data may include a reference address of a data reference point of the descriptor in a data storage space of the tensor data. Wherein the reference address may be different according to a variation of the data reference point. The present disclosure does not limit the selection of data reference points.

In one possible implementation, the base address may comprise a start address of the data storage space. When the data reference point of the descriptor is the first data block of the data storage space, the reference address of the descriptor is the start address of the data storage space. When the data reference point of the descriptor is data other than the first data block in the data storage space, the reference address of the descriptor is the address of the data block in the data storage space.

In one possible implementation, the shape parameters of the tensor data include at least one of: the size of the data storage space in at least one direction of the N dimensional directions, the size of the storage area in at least one direction of the N dimensional directions, the offset of the storage area in at least one direction of the N dimensional directions, the positions of at least two vertexes at diagonal positions of the N dimensional directions relative to the data reference point, and the mapping relationship between the data description position of tensor data indicated by the descriptor and the data address. Where the data description position is a mapping position of a point or a region in the tensor data indicated by the descriptor, for example, when the tensor data is 3-dimensional data, the descriptor may represent a shape of the tensor data using three-dimensional space coordinates (x, y, z), and the data description position of the tensor data may be a position of a point or a region in the three-dimensional space to which the tensor data is mapped, which is represented using three-dimensional space coordinates (x, y, z).

It should be understood that shape parameters representing tensor data can be selected by one skilled in the art based on practical considerations, which are not limited by the present disclosure. By using the descriptor in the data access process, the association between the data can be established, thereby reducing the complexity of data access and improving the instruction processing efficiency.

In some embodiments of the present disclosure, management of creation/registration, modification, and logout/release of descriptors may be implemented by descriptor management instructions, and corresponding opcodes are set for the management instructions. The descriptor may be created (registered), for example, by a descriptor creation instruction (trcredit); modifying respective parameters (shape, address, etc.) of the descriptor by a descriptor modification instruction; the descriptor is released (unregistered or deleted) or the like by a descriptor release instruction (TRRelease). The present disclosure is not limited to the type of descriptor management instruction and the specific setting of the opcode. In the following description, "descriptor creation instruction" and "descriptor registration instruction" are used interchangeably, and "descriptor release instruction" and "descriptor deregistration instruction" are used interchangeably.

The operation code of the descriptor management instruction is a specific action (e.g., create, modify, release, etc.), and the operand is a management parameter of the descriptor. The management parameters may be used to indicate the operating parameters of the descriptor management instructions. The present disclosure is not limited to the details of the management parameters.

For the descriptor creation instruction, in one possible implementation, the management parameter thereof may include at least one of an identification of the descriptor, a shape of tensor data indicated by the descriptor, and a content of the tensor data indicated by the descriptor. For example, the management parameters of the descriptor creation instruction may include the descriptor identification TR0, the shape of tensor data indicated by the descriptor (number of dimensions, size of each dimension, offset, start data address, and the like). Upon execution of the descriptor creation instruction, the descriptor may be created according to at least one of the management parameters.

In one implementation, for example, when the management parameters of the descriptor creation instruction include an identifier TR0 of the descriptor, the descriptor TR0 may be created according to TR0, and the descriptor TR0 may be stored in a descriptor storage space (e.g., a register) corresponding to TR 0.

Alternatively or additionally, when the management parameter of the descriptor creation instruction includes the shape of the tensor data indicated by the descriptor, the content of the descriptor may be determined according to the shape of the tensor data indicated by the descriptor, and the content of the descriptor may be stored in the descriptor storage space, thereby completing creation or registration of the descriptor.

Alternatively or additionally, when the management parameter of the descriptor creation instruction further includes the identifier of the descriptor, after determining the content of the descriptor, the content of the descriptor may be stored in a descriptor storage space corresponding to the identifier of the descriptor, and the creation of the descriptor is completed. If the descriptor has no corresponding descriptor storage space, the content of the descriptor can be stored in the descriptor storage space, and the corresponding relation between the descriptor and the descriptor storage space is established, so that the creation of the descriptor is completed.

Alternatively or additionally, when the management parameter of the descriptor creation instruction includes the content of the tensor data indicated by the descriptor, the content of the descriptor may be determined according to the content of the tensor data indicated by the descriptor, the correspondence between the content of the tensor data and the content of the descriptor may be established, and the content of the descriptor may be stored in the descriptor storage space, so as to complete the creation of the descriptor. When the management parameter further includes the identifier of the descriptor, after the content of the descriptor is determined, the content of the descriptor may be stored in a descriptor storage space corresponding to the identifier of the descriptor, and the creation of the descriptor is completed. If the identifier of the descriptor does not have a corresponding fixed descriptor storage space, the content of the descriptor can be stored in the descriptor storage space, and the corresponding relation between the identifier of the descriptor and the descriptor storage space is established, so that the creation of the descriptor is completed.

In one possible implementation, the descriptor may also be created from the shape of the tensor data indicated by the descriptor and the content of the tensor data indicated by the descriptor, or from the identification of the descriptor, the shape of the tensor data indicated by the descriptor and the content of the tensor data indicated by the descriptor. The present disclosure does not limit the combination and specific values of the management parameters in the descriptor creation instruction.

In one possible implementation, the descriptor creating instruction may include management parameters of a plurality of descriptors, for example, identifiers TR0, TR1, and TR2 of the descriptors in the instruction, and then the descriptors TR0, TR1, and TR2 may be created according to the management parameters (at least one of the identifier of the descriptor, the shape of tensor data indicated by the descriptor, and the content of the tensor data indicated by the descriptor). The creation process for each TR is the same as or similar to the creation process described above. Therefore, a plurality of descriptors can be created in batch according to one instruction, and the creation efficiency of the descriptors is further improved.

In the embodiments of the present disclosure, the descriptor can be created according to at least one of the identification of the descriptor, the shape of the tensor data indicated by the descriptor, and the content of the tensor data indicated by the descriptor, so that the creation of the descriptor can satisfy various operation and/or use needs, and thus the processing efficiency of the descriptor can be improved.

In one possible implementation, creating the descriptor according to at least one of an identification of the descriptor, a shape of the tensor data indicated by the descriptor, and a content of the tensor data indicated by the descriptor may include: determining a first storage area of the content of the descriptor in the descriptor storage space and a second storage area of the content of the tensor data indicated by the descriptor in the data storage space; determining the content of the descriptor according to at least one of the identifier of the descriptor, the shape of the tensor data indicated by the descriptor and the content of the tensor data indicated by the descriptor, and establishing a corresponding relation between the descriptor and the second storage area; storing the contents of the descriptor in the first storage area.

The first storage area and/or the second storage area may be determined directly if at least one of the storage areas has been set in advance. For example, if the content of the descriptor and the content of the tensor data are stored in the same memory space, and the memory addresses of the content of the descriptor corresponding to the identifier TR0 of the descriptor are ADDR32-ADDR63, and the memory addresses of the content of the tensor data are ADDR64-ADDR1023, the two addresses can be directly determined as the first memory area and the second memory area.

If there is no preset memory area, a first memory area may be allocated in the descriptor memory space for the descriptor content and a second memory area may be allocated in the data memory space for the tensor data content.

Taking the tensor data 23 shown in fig. 1 as an example, the registration parameters may include a start address PA _ start (reference address) of the data storage space 21, an offset 25 (shown as offset _ X) in the X-axis direction, an offset 24 (shown as offset _ Y) in the Y-axis direction, a size (shown as size _ X) in the X-axis direction, and a size (shown as size _ Y) in the Y-axis direction. With these parameters, the content of the descriptor can be expressed as formula (1) and stored in the first storage area, thereby completing the creation process of the descriptor.

In this way, the descriptor can be automatically created according to the descriptor creation instruction, and the correspondence between the tensor indicated by the descriptor and the data address is realized, so that the data address is obtained through the content of the descriptor during data processing, and the data access efficiency of the processor is improved.

In one possible implementation, the content of the tensor data indicated by the descriptor includes at least one of an immediate and data in a register. The immediate may be tensor data that does not change during data processing. After the corresponding relationship between the descriptor and the immediate is established, the descriptor can be used to replace the immediate in the data processing process. The content of the tensor data indicated by the descriptor can also comprise data in a register, and after the corresponding relation between the descriptor and the data in the register is established, the number of the register can be used as the identifier of the descriptor. In the present embodiment, by indicating the immediate and the data in the register through the descriptor, the complexity of using the immediate and the data in the register can be reduced, thereby improving the efficiency of data processing.

For the descriptor release instruction, in a possible implementation, the management parameter may include an identifier of the descriptor, which indicates the descriptor to be released/logged out.

In a possible implementation manner, the management parameters of the descriptor release instruction may include an identifier of at least one descriptor, that is, one descriptor may be released or multiple descriptors may be released simultaneously by one descriptor release instruction.

In a possible implementation manner, the descriptor release instruction may include an identifier of a part of descriptors, that is, only a part of descriptors in the current descriptor are released, or may include an identifier of a whole descriptor, that is, all descriptors currently are released.

In a possible implementation manner, releasing a descriptor corresponding to an identifier of a descriptor according to the identifier may include: and respectively releasing the storage area of the descriptor in the descriptor storage space and the storage area of the content of the tensor data indicated by the descriptor in the data storage space. By the method, the space occupied by the descriptor can be released after the use of the descriptor is finished, the limited storage resource can be repeatedly utilized, and the utilization efficiency of the resource is improved.

In a possible implementation manner, releasing a descriptor corresponding to an identifier of a descriptor according to the identifier may include: storing the content of the descriptor stored in the descriptor storage space to a specified storage space; and logging out the descriptor corresponding to the identifier. That is, in this implementation, the release operation may be performed after saving the contents of the descriptor to be released. By saving the descriptor content and then releasing the descriptor, the resources (such as descriptor identifiers, memory spaces and other resources) occupied by the current descriptor can be released while saving the descriptor content which needs to be used subsequently, thereby improving the resource utilization efficiency.

For the descriptor modification instruction, in one possible implementation, the management parameter thereof may include at least one of an identification of the descriptor, a content of the descriptor to be modified, and a content of tensor data indicated by the descriptor. When the descriptor modifying instruction is executed, determining the content to be updated of the descriptor according to the management parameter of the descriptor; and updating at least one of the identifier of the descriptor, the content of the descriptor in the descriptor storage space and the content of the tensor data in the data storage space according to the content to be updated. The descriptor modification instruction may be used to modify various parameters of the descriptor, such as the identification of the descriptor, the tensor shape, and so on.

In one possible implementation, when the processing instruction is a descriptor modification instruction, the content to be updated of the descriptor may be determined according to management parameters of the descriptor, for example, changing the dimension of the tensor from 3 dimensions to 2 dimensions, changing the size of the tensor in one or more dimension directions, and the like.

In this way, when the tensor data indicated by the descriptor is changed, the descriptor can be directly modified to maintain the correspondence between the descriptor and the tensor data, and the utilization efficiency of resources is improved.

In the instruction system for tensor data processing, it was described above that descriptors and corresponding descriptor management instructions are introduced. To improve instruction processing efficiency, an ordered instruction stream is typically executed out-of-order. Out-of-order execution can improve the efficiency of the pipeline, and mine potential overlap or irrelevancy among instructions. Therefore, how to determine whether the instructions can be executed out of order directly affects the processing efficiency of the instruction system.

In the embodiment of the disclosure, an instruction processing scheme is provided, in which two factors, namely the type of the operation code of the instructions and the dependency relationship between the instructions, are taken into consideration simultaneously when judging whether the instructions can be executed out of order, potential parallelism between the instructions is mined as much as possible, so that the condition of the out-of-order execution of the instructions can be relaxed, the processing flexibility is increased, and the processing efficiency of the machine is improved.

FIG. 3 shows a schematic block diagram of an instruction processing apparatus 300 according to an embodiment of the present disclosure. As shown in fig. 3, the instruction processing apparatus 300 includes a decoding unit 310, a dependency check unit 320, and an instruction transmitting unit 330.

Decode unit 310 may be configured to decode an instruction to obtain an opcode and operands for the instruction. The dependency check unit 320 may be configured to determine whether a dependency exists between two decoded instructions. The instruction issue unit 330 may be configured to determine whether two instructions are capable of out-of-order execution based on the type of opcodes of the two instructions and the dependencies determined by the dependency check unit 320. By considering two factors of the type and the dependency relationship of the operation codes of the instructions, the constraint condition of out-of-order execution can be further refined, so that the parallelism among the instructions is fully utilized, and the processing efficiency is improved.

In some embodiments, the instruction transmitting unit 330 may be further configured to: when both instructions are descriptor management instructions of a specified type, it is determined that the two instructions can be executed out of order, wherein the descriptors are used for information indicative of tensor data. Alternatively or additionally, the instruction transmitting unit 330 may be further configured to: when two instructions are different and have a dependency relationship with each other, determining that the two instructions need to be executed in order. Alternatively or additionally, the instruction transmitting unit 330 may be further configured to: when there is no dependency between two instructions, it is determined that the two instructions can be executed out of order.

As can be seen from the foregoing description of descriptor management instructions, there are some types of descriptor management instructions that have possible parallelism, even though there may be dependencies between the instructions. For example, for a descriptor creation instruction, it is to register descriptors indicating tensor data, which involves creating descriptors. If both instructions are descriptor-creating instructions, no matter which instruction executes first, the execution of the next instruction will not be affected. For another example, the descriptor release instruction, which releases the descriptor indicating the tensor data, involves destroying the descriptor. If both instructions are descriptor release instructions, no matter which instruction executes first, execution of the next instruction will not be affected.

Thus, by carefully analyzing the specific operations involved by the various descriptor management instructions, a possible instruction parallelism scenario can be explored. In some implementations, the descriptor management instructions of the specified type include any of: a descriptor creation instruction and a descriptor release instruction.

FIG. 4 shows a schematic diagram of an instruction flow to better understand the decision conditions for out-of-order execution of instructions in embodiments of the present disclosure. A descriptor a create instruction 411, a descriptor a release instruction 412, a descriptor B create instruction 413, and a descriptor B release instruction 414 are shown in the original order in the instruction stream 410 shown on the left side of fig. 4. The instruction stream 420 shown on the right side of fig. 4 shows a creation instruction 421 for descriptor a, a creation instruction 422 for descriptor B, a release instruction 423 for descriptor B, and a release instruction 424 for descriptor a in the original order. Those skilled in the art will appreciate that instruction streams 410 and 420 may also include other instructions, which are not shown in the figures for clarity and brevity. It is assumed in the example of fig. 4 that there is a dependency between descriptor a and descriptor B.

Based on the above-described determination condition, for the release instruction 412 of the descriptor a and the create instruction 413 of the descriptor B in the instruction stream 410, since the opcode types of the two instructions are different and there is a dependency relationship between the instructions, the instructions 412 and 413 cannot be executed out of order.

For the creation instruction 421 of the descriptor a and the creation instruction 422 of the descriptor B in the instruction stream 420, since the operation code types of the two instructions are the same (both are descriptor creation instructions), the instructions 421 and 422 can be executed out of order despite the dependency relationship between the two instructions. In addition, for the release instruction 423 for the descriptor B and the release instruction 424 for the descriptor a in the instruction stream 420, since the types of opcodes of the two instructions are the same (both are descriptor release instructions), the instructions 423 and 424 can be executed out of order despite the dependency relationship between the two instructions.

Returning to FIG. 3, after instruction issue unit 330 determines whether the instructions can be executed out-of-order, the corresponding processing may be performed on the instructions. For example, if an instruction is capable of out-of-order execution, instruction issue unit 330 may issue the instruction to a corresponding execution unit (not shown) to execute the instruction, or to a corresponding queue to wait for out-of-order execution. If the instructions cannot be executed out of order, the instruction issue unit may initiate a corresponding order preserving process to ensure the original order among the instructions. The order of execution of the instructions may be ensured in a number of ways. For example, synchronization instructions may be inserted between instructions to ensure ordering between instructions. The present disclosure is not limited to a particular manner of order preservation.

Optionally, the instruction processing apparatus 300 may further include a storage unit 340 configured to store the dependency table 341. Thus, the dependency check unit 320 may be configured to determine whether a dependency exists between two instructions by looking up a dependency table 341 pre-stored in the storage unit 340. Those skilled in the art will appreciate that the storage unit 340 may also be configured to store various other information including, but not limited to, instructions, descriptor-associated information, tensor data, and the like. The storage unit 340 may include various storage resources including, but not limited to, an internal memory and an external memory. The internal memory may include, for example, registers, on-chip SRAM, or other media cache. The external memory may comprise, for example, off-chip memory. The present disclosure is not limited to a particular implementation of the memory cell.

The dependency table 341 in the storage unit 340 may record dependencies between tensor data, so that the dependency check unit 320 may be configured to look up the dependency table 341 to determine dependencies between instructions based on tensor data involved in operands of the instructions.

In some implementations, the dependency table 341 may record the dependencies between tensor data in a matrix form. Table 1 shows an exemplary dependency table in the form of a lookup matrix, where 1 represents no dependency and 0 represents dependency; and vice versa.

	Tenosr A	Tensor B	Tensor C
				Tensor A	0	1	0
Tensor B	1	0	1
				Tensor C	0	1	0

Table 1: looking up a dependency table in the form of a matrix

In the dependency relationship information recorded in table 1, Tensor a and Tensor B are independent, Tensor a and Tensor C are dependent, and Tensor B and Tensor C are independent. It should be noted that there is redundant information in table 1 (for example, the dependency relationship between the Tensor a and the Tensor C is recorded twice), so that the matrix can be reduced in practical use, for example, the matrix can be reduced to a diagonal matrix. In addition, tensor data itself may have a dependency relationship by default, and therefore this information may not be stored, thereby further reducing the matrix.

In other implementations, the dependency table 341 may record the dependent tensor data pairs in a configuration table. Table 2 shows an exemplary dependency table in the form of a configuration table in which only tensor data pairs for which dependencies exist are saved.

Tensor A	Tensor B
		Tensor A	Tensor C
Tensor B	Tensor C
		Tensor C	Tensor D

Table 2: dependency table in the form of a configuration table

In table 2, Tensor data pairs with dependency relationships are recorded, such as a dependency relationship between tensors a and tensors B, a dependency relationship between tensors a and tensors C, a dependency relationship between tensors B and tensors C, and a dependency relationship between tensors C and tensors D.

In some embodiments, the dependency relationships between tensor data are recorded in the dependency relationship table described above at the same time as or after the tensor data are created.

Optionally, the instruction processing apparatus 300 may further include a processing unit 350 configured to record the dependency relationship between the tensor data in the dependency relationship table at the same time or after the tensor data is created. In particular, the processing unit 350 may be configured to: determining dependencies between tensor data; and recording the dependency in a dependency table.

The processing unit 350 may determine the dependency between the tensor data in a number of ways.

In some implementations, the dependencies may be determined by software means. For example, whether or not there is a dependency relationship between tensor data may be determined from the instruction information on the software side. The software-side indication information may be, for example, a flag bit included in the instruction to indicate whether there is a dependency relationship between tensor data.

In other implementations, the dependencies may be determined by hardware means. For example, whether the dependency relationship exists between the tensor data can be determined by judging whether the address spaces corresponding to the tensor data overlap with each other through hardware. In one example, the hardware may calculate an address space corresponding to the tensor data according to shape information of the tensor data (for example, information provided by the descriptor described above), and determine whether there is a dependency relationship between the tensor data by comparing the address spaces. In another example, the hardware can directly and quickly determine whether there are dependencies between tensor data by comparing whether the spatial identifications of the tensor data are the same. In this example, the spatial identifier refers to a spatial region for storing the corresponding tensor data, and the spatial region may be a continuous space or a multi-segment space. Different spatial signatures represent spatial regions pointed to without dependencies.

The processing unit 350 may create or modify the dependency table accordingly according to the determination result of the dependency. For example, the processing unit 350 stores the determined dependency in a dependency table.

In some embodiments, when the processing unit 350 determines that there is a dependency between the current tensor data and the previous tensor data but the dependency table has no available space, the processing unit 350 may selectively block operations on the current tensor data until the dependency table has available space.

For example, table 2 above shows that there are 1 more recording spaces, where the processing unit 350 may record the determined dependencies therein. If there is no recording space in table 2, the processing unit 350 may block the operation on the current tensor data until there is a new recording space in the dependency table.

As can be seen from the instruction processing scheme described above, the constraints on out-of-order execution of instructions can be expanded by adding opcode types that take into account the instructions. Therefore, when the processing unit 350 needs to block the operation with respect to the current tensor data, the operation may also be selectively blocked in consideration of the aforementioned constraint condition.

In some embodiments, the processing unit 350 may be configured not to block the operation instruction for the current tensor data when the operation instruction for the current tensor data is the same as the operation instruction for the previous tensor data as the descriptor management instruction of the specified type; otherwise, the operation instruction for the current tensor data is blocked or cached.

Although the decoding unit 310, the dependency check unit 320, the instruction transmitting unit 330 and the processing unit 350 are illustrated as separate modules in fig. 3, those skilled in the art will appreciate that these several units may also be implemented as one module or recombined and split into more modules, and the present disclosure is not limited in this respect.

Instruction processing device 300 may be implemented using a general purpose processor (e.g., a central processing unit CPU, a graphics processing unit GPU) and/or a special purpose processor (e.g., an artificial intelligence processor, a scientific computing processor, or a digital signal processor, etc.), and the present disclosure is not limited to a particular type of instruction processing device.

FIG. 5 illustrates an exemplary flow diagram of an instruction processing method 500 according to an embodiment of the disclosure. Instruction processing method 500 may be implemented, for example, by instruction processing apparatus 300 of FIG. 3.

As shown in FIG. 5, the method 500 begins in step S510 by decoding an instruction to obtain an opcode and an operand of the instruction. This step may be performed, for example, by decode unit 310 of fig. 3.

Next, in step S520, it is determined whether a dependency exists between the two decoded instructions. This step may be performed, for example, by the dependency check unit 320 of fig. 3.

In some implementations, it may be determined whether a dependency exists between two instructions by looking up a pre-stored dependency table. The dependency table may record dependencies between tensor data such that the dependency table may be looked up to determine dependencies between instructions based on tensor data involved in operands of the instructions.

As previously described, the dependency table may be maintained in the storage unit in the form of various data structures. In some implementations, the dependency table records the dependencies between tensor data in a matrix form. In other implementations, the dependency table records the dependent tensor data pairs in a configuration table. The dependency relationship between the tensor data may be recorded in the dependency relationship table at the same time as or after the tensor data is created.

Finally, in step S530, it is determined whether the two instructions can be executed out of order according to the types and dependencies of the opcodes of the two instructions.

In some implementations, determining whether out-of-order execution is possible between two instructions may include: when both instructions are descriptor management instructions of a specified type, it is determined that the two instructions can be executed out of order, wherein the descriptors are used for information indicative of tensor data. The specified type of descriptor management instruction may include any of: a descriptor creation instruction and a descriptor release instruction. Optionally, determining whether out-of-order execution is possible between the two instructions further comprises: when two instructions are different and are descriptor management instructions of a specified type and a dependency relationship exists between the two instructions, the two instructions are determined to need order-preserving execution. Optionally, determining whether out-of-order execution is possible between the two instructions further comprises: when no dependency exists between two instructions, it is determined that the two instructions can be executed out of order.

FIG. 6A illustrates an exemplary flow chart for determining whether instructions can be executed out of order according to embodiments of the disclosure. As shown in fig. 6A, after the opcode types of the two instructions are obtained, in step S610, it is determined whether the opcodes are both the opcode of the specified type. If yes, go directly to S612 to determine that out-of-order execution is possible. Otherwise, in step S614, it is further determined whether there is a dependency relationship between the instructions. If the dependency does not exist, jumping to S612, and determining that out-of-order execution can be performed; otherwise, in step S616, it is determined that out-of-order execution cannot be performed, and then the corresponding order-preserving operation may be started. As described above with reference to fig. 3, will not be described in detail here.

FIG. 6B illustrates an exemplary flow chart for determining whether instructions can be executed out of order according to another embodiment of the disclosure. As shown in fig. 6B, in step S620, it is determined whether or not there is a dependency relationship between the instructions. This may be determined, for example, by looking up a dependency table. If there is no dependency, the process goes directly to S622 to determine that out-of-order execution is possible. Otherwise, in step S624, it is further determined whether the instruction opcode is the same as the specified type of opcode. If yes, jumping to S622, and determining that out-of-order execution is possible; otherwise, in step S626, it is determined that out-of-order execution is not possible, and then the corresponding order-preserving operation may be initiated.

The instruction processing method executed by the instruction processing apparatus of the embodiment of the present disclosure has been described above with reference to the flowchart. It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.

It is further noted that, although the steps in the flowcharts of fig. 5, 6A and 6B are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 5, 6A, and 6B may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least some of the sub-steps or stages of other steps.

Fig. 7 is a block diagram illustrating a combined processing device 700 according to an embodiment of the present disclosure. As shown in fig. 7, the combined processing device 700 includes a computing processing device 702, an interface device 704, other processing devices 706, and a storage device 708. Depending on the application scenario, one or more computing devices 710 may be included in the computing device, and may be configured as the instruction processing device 300 shown in fig. 3 to perform the operations described herein in conjunction with fig. 5, 6A, and 6B.

In various embodiments, the computing processing device of the present disclosure may be configured to perform user-specified operations. In an exemplary application, the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core. When multiple computing devices are implemented as artificial intelligence processor cores or as part of a hardware structure of an artificial intelligence processor core, computing processing devices of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure.

In an exemplary operation, the computing processing device of the present disclosure may interact with other processing devices through an interface device to collectively perform user-specified operations. Other Processing devices of the present disclosure may include one or more types of general and/or special purpose processors, such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), and artificial intelligence processors, depending on the implementation. These processors may include, but are not limited to, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic, discrete hardware components, etc., and the number may be determined based on actual needs. As previously mentioned, the computing processing device of the present disclosure may be considered to have a single core structure or an isomorphic multi-core structure only. However, when considered together, a computing processing device and other processing devices may be considered to form a heterogeneous multi-core structure.

In one or more embodiments, the other processing device can interface with external data and controls as a computational processing device of the present disclosure (which can be embodied as an artificial intelligence, e.g., a computing device associated with neural network operations), performing basic controls including, but not limited to, data handling, starting and/or stopping of the computing device, and the like. In further embodiments, other processing devices may also cooperate with the computing processing device to collectively perform computational tasks.

In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing processing device may obtain input data from other processing devices via the interface device, and write the input data into a storage device (or memory) on the computing processing device. Further, the computing processing device may obtain the control instruction from the other processing device via the interface device, and write the control instruction into the control cache on the computing processing device slice. Alternatively or optionally, the interface device may also read data from the memory device of the computing processing device and transmit the data to the other processing device.

Additionally or alternatively, the combined processing device of the present disclosure may further include a storage device. As shown in the figure, the storage means is connected to the computing processing means and the further processing means, respectively. In one or more embodiments, the storage device may be used to hold data for the computing processing device and/or the other processing devices. For example, the data may be data that is not fully retained within internal or on-chip storage of a computing processing device or other processing device.

In some embodiments, the present disclosure also discloses a chip (e.g., chip 802 shown in fig. 8). In one implementation, the Chip is a System on Chip (SoC) and is integrated with one or more combinatorial processing devices as shown in fig. 7. The chip may be connected to other associated components through an external interface device (such as external interface device 806 shown in fig. 8). The relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card, or a wifi interface. In some application scenarios, other processing units (e.g., video codecs) and/or interface modules (e.g., DRAM interfaces) and/or the like may be integrated on the chip. In some embodiments, the disclosure also discloses a chip packaging structure, which includes the chip. In some embodiments, the present disclosure also discloses a board card including the above chip packaging structure. The board will be described in detail below with reference to fig. 8.

Fig. 8 is a schematic diagram illustrating a structure of a board 800 according to an embodiment of the disclosure. As shown in FIG. 8, the board includes a memory device 804 for storing data, which includes one or more memory cells 810. The memory device may be connected and data transferred to and from the control device 808 and the chip 802 described above by means of, for example, a bus. Further, the board includes an external interface 806 configured for data relay or transfer between the chip (or chips in the chip package) and an external device 812 (e.g., a server or computer). For example, the data to be processed may be transferred to the chip by an external device through an external interface means. For another example, the calculation result of the chip may be transmitted back to an external device via the external interface device. According to different application scenarios, the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface or the like.

In one or more embodiments, the control device in the disclosed card may be configured to regulate the state of the chip. Therefore, in an application scenario, the control device may include a single chip Microcomputer (MCU) for controlling the operating state of the chip.

From the above description in conjunction with fig. 7 and 8, it will be understood by those skilled in the art that the present disclosure also discloses an electronic device or apparatus, which may include one or more of the above boards, one or more of the above chips and/or one or more of the above combination processing devices.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud, an edge, and a terminal. In one or more embodiments, a computationally powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or apparatus may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of acts and combinations thereof, but those skilled in the art will appreciate that the aspects of the present disclosure are not limited by the order of the acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in that the acts or modules involved are not necessarily required for the implementation of the solution or solutions of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.

In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are divided based on the logic functions, and there may be other dividing manners in actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, some or all of the units can be selected to achieve the purpose of the solution described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In some implementation scenarios, the integrated units may be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated units may be stored in a computer readable memory. In this regard, when aspects of the present disclosure are embodied in the form of a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described in embodiments of the present disclosure. The Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In view of this, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing may be better understood in light of the following clauses:

clause 1. an instruction processing apparatus comprising:

the decoding unit is configured to decode an instruction to obtain an operation code and an operand of the instruction;

a dependency check unit configured to determine whether a dependency exists between the decoded two instructions; and

and the instruction transmitting unit is configured to determine whether the two instructions can be executed out of order or not according to the types of the operation codes of the two instructions and the dependency relationship.

Clause 2. the instruction processing device according to clause 1, wherein the instruction transmitting unit is further configured for:

determining that the two instructions are capable of out-of-order execution when the two instructions are both descriptor management instructions of a specified type, wherein the descriptor is for information indicative of tensor data.

Clause 3. the instruction processing apparatus according to any of clauses 1-2, wherein the instruction transmitting unit is further configured to:

determining that the two instructions need order-preserving execution when the two instructions are different to be a descriptor management instruction of a specified type and a dependency exists between the two instructions, wherein the descriptor is used for indicating information of tensor data.

Clause 4. the instruction processing apparatus according to any of clauses 2-3, wherein the descriptor management instruction of the specified type comprises any of: a descriptor creation instruction and a descriptor release instruction.

Clause 5. the instruction processing apparatus according to any of clauses 1-4, wherein the instruction transmitting unit is further configured to:

determining that the two instructions are capable of out-of-order execution when no dependency exists between the two instructions.

Clause 6. the instruction processing apparatus according to any of clauses 1 to 5, further comprising a storage unit configured to store a dependency table, wherein the dependency check unit is further configured to:

and determining whether the dependency relationship exists between the two instructions by searching a dependency relationship table stored in the storage unit in advance.

Clause 7. the instruction processing apparatus according to clause 6, wherein the dependency table records dependencies between tensor data, the dependency check unit being further configured to:

based on tensor data involved in operands of the instruction, the dependency table is looked up to determine the dependencies.

Clause 8. the instruction processing apparatus of clause 7, wherein:

the dependency relationship table records the dependency relationship among tensor data in a matrix form; or

The dependency relationship table records tensor data pairs with dependency relationships in a configuration table form.

Clause 9. the instruction processing apparatus according to any one of clauses 7 to 8, wherein the dependency relationship between the tensor data is recorded in the dependency relationship table at the same time as or after the tensor data is created.

Clause 10. the instruction processing apparatus of clause 9, further comprising a processing unit configured to:

determining dependencies between the tensor data; and

recording the dependency in the dependency table.

Clause 11. the instruction processing apparatus of clause 10, wherein the processing unit is further configured to determine the dependency between the tensor data in any one of the following ways:

determining whether dependency exists between the tensor data according to the indication information of the software side; or

And determining whether the dependency relationship exists between the tensor data by judging whether the address spaces corresponding to the tensor data are overlapped.

Clause 12. the instruction processing apparatus of any of clauses 10-11, wherein the processing unit is further configured to:

selectively blocking operations on the current tensor data until the dependency table has available space when it is determined that there is a dependency between the current tensor data and a previous tensor data but the dependency table has no available space.

Clause 13. the instruction processing apparatus of clause 12, wherein the processing unit is further configured to selectively block operations on the current tensor data as follows:

when the operation instruction for the current tensor data and the operation instruction for the previous tensor data are the same descriptor management instruction of a specified type, not blocking the operation instruction for the current tensor data; otherwise, blocking or caching the operation instruction aiming at the current tensor data.

Clause 14. a chip, characterized in that it comprises an instruction processing device according to any of clauses 1 to 13.

Clause 15, a board, wherein the board includes the chip of clause 14.

Clause 16. an instruction processing method, comprising:

decoding an instruction to obtain an operation code and an operand of the instruction;

determining whether a dependency exists between two decoded instructions; and

and determining whether the two instructions can be executed out of order or not according to the types of the operation codes of the two instructions and the dependency relationship.

Clause 17. the method of instruction processing according to clause 16, wherein said determining whether out-of-order execution is possible between the two instructions comprises:

Clause 18. the method of instruction processing according to any of clauses 16-17, wherein the determining whether out-of-order execution is possible between the two instructions further comprises:

when the two instructions are different from each other and are a descriptor management instruction of a specified type and a dependency relationship exists between the two instructions, determining that the two instructions need order-preserving execution, wherein the descriptor is used for indicating information of tensor data.

Clause 19. the instruction processing method according to any of clauses 17-18, wherein the descriptor management instruction of the specified type comprises any of: a descriptor creation instruction and a descriptor release instruction.

Clause 20. the method of instruction processing according to any of clauses 16-19, wherein the determining whether out-of-order execution is possible between the two instructions further comprises:

Clause 21. the method of instruction processing according to any of clauses 16-20, wherein the determining whether a dependency exists between two decoded instructions comprises:

and determining whether the dependency relationship exists between the two instructions by searching a pre-stored dependency relationship table.

Clause 22. the instruction processing method of clause 21, wherein the dependency table records dependencies between tensor data, and the determining whether a dependency exists between two decoded instructions further comprises:

looking up the dependency table to determine the dependencies based on tensor data involved in operands of the instruction.

Clause 23. the instruction processing method according to clause 22, wherein:

Clause 24. the instruction processing method according to any of clauses 22-23, wherein the dependency relationship between the tensor data is recorded in the dependency relationship table at the same time as or after the tensor data is created.

Claims

1. An instruction processing apparatus comprising:

2. The instruction processing device of claim 1, wherein the instruction issue unit is further configured to:

3. The instruction processing apparatus according to any of claims 1-2, wherein the instruction issue unit is further configured to:

4. An instruction processing apparatus according to any one of claims 2 to 3, wherein the descriptor management instruction of the specified type comprises any one of: a descriptor creation instruction and a descriptor release instruction.

5. The instruction processing apparatus according to any of claims 1-4, wherein the instruction issue unit is further configured to:

6. The instruction processing apparatus according to any of claims 1-5, further comprising a storage unit configured to store a dependency table, wherein the dependency check unit is further configured to:

7. The instruction processing apparatus according to claim 6, wherein the dependency table records dependencies between tensor data, the dependency check unit being further configured to:

8. The instruction processing apparatus of claim 7, wherein:

9. The instruction processing apparatus according to any one of claims 7 to 8, wherein the dependency relationships between the tensor data are recorded in the dependency relationship table at the same time as or after creation of the tensor data.

10. The instruction processing apparatus of claim 9, further comprising a processing unit configured to:

determining dependencies between the tensor data; and

recording the dependency in the dependency table.

11. The instruction processing device of claim 10, wherein the processing unit is further configured to determine dependencies between the tensor data in any one of:

determining whether the tensor data have a dependency relationship according to indication information of a software side; or

12. The instruction processing apparatus according to any of claims 10-11, wherein the processing unit is further configured to:

when it is determined that there is a dependency between current tensor data and prior tensor data but the dependency table has no available space, selectively blocking operations on the current tensor data until the dependency table has available space.

13. The instruction processing apparatus of claim 12, wherein the processing unit is further configured to selectively block operations on the current tensor data as follows:

14. A chip, characterized in that it comprises an instruction processing device according to any one of claims 1-13.

15. A card comprising the chip of claim 14.

16. An instruction processing method, comprising:

determining whether a dependency exists between two decoded instructions; and

17. The instruction processing method of claim 16, wherein the determining whether out-of-order execution is possible between the two instructions comprises:

18. The method of any of claims 16-17, wherein the determining whether out-of-order execution is possible between the two instructions further comprises:

19. The instruction processing method according to any of claims 17-18, wherein said descriptor management instruction of a specified type comprises any of: a descriptor creation instruction and a descriptor release instruction.

20. The instruction processing method of any of claims 16-19, wherein said determining whether out-of-order execution is possible between the two instructions further comprises:

21. The instruction processing method of any of claims 16-20, wherein said determining whether a dependency exists between two decoded instructions comprises:

22. The instruction processing method of claim 21, wherein the dependency table records dependencies between tensor data, and the determining whether a dependency exists between two decoded instructions further comprises:

23. The instruction processing method of claim 22, wherein:

24. The instruction processing method according to any one of claims 22 to 23, wherein the dependency relationships between the tensor data are recorded in the dependency relationship table at the same time as or after the tensor data are created.