CN115221104A

CN115221104A - Data processing device, data processing method and related product

Info

Publication number: CN115221104A
Application number: CN202110482855.2A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambrian Jixingge Nanjing Technology Co ltd
Current assignee: Cambrian Jixingge Nanjing Technology Co ltd
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2022-10-21

Abstract

The present disclosure discloses a data processing apparatus, a data processing method, and a related product. The data processing apparatus may be implemented as a computing apparatus included in a combined processing apparatus, which may also include interface apparatus and other processing apparatus. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for storing data of the computing device and the other processing device. The scheme disclosed by the invention provides a special instruction for data fusion related operation, which can simplify the processing and improve the processing efficiency of the machine.

Description

Data processing device, data processing method and related product

Technical Field

The present disclosure relates generally to the field of data processing. More particularly, the present disclosure relates to a data processing apparatus, a data processing method, a chip, and a board.

Background

In recent years, with the rapid development of deep learning, the performance of algorithms in a series of fields such as computer vision and natural language processing has been developed in a cross-over manner. However, the deep learning algorithm is a calculation-intensive and storage-intensive tool, and with the increasing complexity of information processing tasks and the increasing requirements for the real-time performance and accuracy of the algorithm, the neural network is often designed to be deeper and deeper, so that the requirements for the calculation amount and the storage space are increased, and the existing artificial intelligence technology based on deep learning is difficult to be directly applied to mobile phones, satellites or embedded devices with limited hardware resources.

Therefore, compression, acceleration, optimization of the deep neural network model becomes of great importance. A large number of researches try to reduce the calculation and storage requirements of the neural network on the premise of not influencing the model precision, and have very important significance on the engineering application of the deep learning technology at an embedded end and a mobile end. Thinning is one of the model lightweight methods.

The network parameter sparsification is to reduce redundant components in a larger network by a proper method so as to reduce the requirement of the network on the calculation amount and the storage space. Existing hardware and/or instruction sets may not be able to efficiently support sparsification and/or post-sparsification related processing.

Disclosure of Invention

In order to at least partially solve one or more technical problems mentioned in the background, the present disclosure provides a data processing apparatus, a data processing method, a chip, and a board.

In a first aspect, the present disclosure discloses a data processing apparatus comprising: a control circuit configured to parse a fuse instruction, the fuse instruction indicating to perform a fuse process on multiple paths of data to be fused, and at least one operand of the fuse instruction including at least one descriptor indicating at least one of the following information: shape information of tensor data and spatial information of tensor data; a tensor interface circuit configured to parse the descriptors; a storage circuit configured to store information before and/or after the fusion process; and the operation circuit is configured to merge data elements in the multiple paths of data to be merged into one path of ordered merged data according to corresponding indexes of the data elements based on the analyzed descriptors and according to the merging instruction, wherein each merged data element is represented by using an operation structure element, and the data elements comprise any one of scalars, vectors or higher-dimensional data.

In a second aspect, the present disclosure provides a chip comprising the data processing apparatus of any of the embodiments of the first aspect.

In a third aspect, the present disclosure provides a board card comprising the chip of any of the embodiments of the second aspect.

In a fourth aspect, the present disclosure provides a data processing method comprising: parsing a fuse instruction, the fuse instruction indicating that a fuse process is performed on multiple paths of data to be fused, and at least one operand of the fuse instruction including at least one descriptor indicating at least one of the following information: shape information of tensor data and spatial information of tensor data; parsing the descriptor; merging the data elements in the multi-path data to be fused into one path of ordered fused data according to corresponding indexes of the data elements based on the analyzed descriptor and the fusion instruction, wherein each fused data element is represented by an operation structure element, and the data elements comprise any one of scalars, vectors or higher-dimensional data; and outputting the fused data.

With the data processing apparatus, the data processing method, the chip and the board provided as above, the embodiments of the present disclosure provide a fused instruction for performing an operation related to merging-sorting fusion of multiple paths of data, wherein at least one operand of the fused instruction is described by a descriptor. In some embodiments, the fused instruction is a hardware instruction, and the data fusion process is implemented by a dedicated hardware circuit. In some embodiments, the data processing apparatus may merge multiple paths of ordered data into one path of ordered merged data according to the merging instruction in an index order, and data elements in the merged data may be represented in the form of an operation structure, thereby facilitating subsequent calculation processing.

In some embodiments, an operation mode bit may be included in the fused instruction to indicate that the fused instruction is a merge-sort fusion process, or the fused instruction itself may indicate a merge-sort fusion process operation.

In some embodiments, the data elements to be fused may be vectors or higher dimensional data. For example, the data elements in the data to be fused may be thinned-out valid data elements in radar-based object detection, so that the data fusion instructions and fusion operations provided by embodiments of the present disclosure may support relevant processing in radar algorithms.

By providing a special fusing instruction to perform an operation related to the fusing processing of the multiplexed data, the processing can be simplified. Further, by providing a hardware implementation of specialized data fusion related operations, processing may be accelerated, thereby increasing the processing efficiency of the machine.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to like or corresponding parts and in which:

fig. 1 is a block diagram illustrating a board card according to an embodiment of the present disclosure;

FIG. 2 is a block diagram illustrating a combined processing device of an embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating the internal structure of a processor core of a single or multi-core computing device of an embodiment of the present disclosure;

FIG. 4 shows a schematic diagram of a data storage space according to an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of data chunking in a data storage space, according to an embodiment of the present disclosure;

FIG. 6 illustrates an exemplary schematic diagram of a data fusion process according to an embodiment of the present disclosure;

FIG. 7 shows a schematic block diagram of a data processing apparatus according to an embodiment of the disclosure;

FIG. 8 illustrates an exemplary circuit diagram for a data fusion process according to one embodiment of the present disclosure;

FIG. 9 illustrates the indication of various descriptors in a fused instruction; and

FIG. 10 illustrates an exemplary flow chart of a data processing method according to an embodiment of the disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc. as may appear in the claims, specification, and drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when.. Or" once "or" in response to a determination "or" in response to a detection ".

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the disclosure. As shown in fig. 1, the board card 10 includes a Chip 101, which is a System on Chip (SoC), or System on Chip, integrated with one or more combined processing devices, which is an artificial intelligence arithmetic unit for supporting various deep learning and machine learning algorithms and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of a platform are high.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred to the chip 101 by the external device 103 through the external interface apparatus 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The card 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 is configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).

Fig. 2 is a structural diagram showing a combined processing device in the chip 101 of this embodiment. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a storage device 204.

The computing device 201 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively perform the user-specified operations.

The interface device 202 is used for transmitting data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, and write to a storage device on-chip with the computing device 201. Further, the computing device 201 may obtain the control instruction from the processing device 203 via the interface device 202, and write the control instruction into a control cache on the computing device 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.

The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201. Depending on the implementation, the processing device 203 may be one or more types of Central Processing Unit (CPU), graphics Processing Unit (GPU) or other general purpose and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be viewed as having a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 201 and the processing device 203 are considered to form a heterogeneous multi-core structure.

The storage device 204 is used to store data to be processed, which may be a DRAM, a DDR memory, and is typically 16G or larger in size, and is used to store data of the computing device 201 and/or the processing device 203.

Fig. 3 shows an internal structure diagram of a processor core when the computing device 201 is a single-core or multi-core device. The computing device 301 is used for processing input data such as computer vision, voice, natural language, data mining, and the like, and the computing device 301 includes three major modules: a control module 31, an arithmetic module 32 and a storage module 33.

The control module 31 is used for coordinating and controlling the operations of the operation module 32 and the storage module 33 to complete the task of deep learning, and includes an Instruction Fetch Unit (IFU) 311 and an Instruction Decode Unit (IDU) 312. The instruction fetch unit 311 is used for obtaining an instruction from the processing device 203, and the instruction decode unit 312 decodes the obtained instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operation, and can support complex operations such as vector multiplication, addition, nonlinear transformation, etc.; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, i.e., matrix multiplication and convolution.

The storage module 33 is used to store or transport related data, and includes a neuron storage unit (neuron RAM, NRAM) 331, a parameter storage unit (weight RAM, WRAM) 332, and a Direct Memory Access (DMA) 333. The NRAM 331 is used for storing input neurons, output neurons and intermediate results after calculation; WRAM 332 is used to store the convolution kernel of the deep learning network, i.e. the weight; the DMA 333 is connected to the DRAM 204 via the bus 34 and is responsible for data transfer between the computing device 301 and the DRAM 204.

Instructions of conventional processors are designed to perform basic single data scalar operations. Here, a single data scalar operation refers to an instruction where each operand is a scalar datum. However, with the development of artificial intelligence technology, in tasks such as image processing and pattern recognition, the oriented operands are often data types of multidimensional vectors (i.e., tensor data), and the use of only scalar or vector operations does not enable hardware to efficiently complete the computational task. Thus, in embodiments of the present disclosure, fusion instructions involving tensor data are provided. At least one operand of the fusion instruction includes tensor data, the tensor data being indicated by at least one descriptor. In particular, the descriptor may indicate at least one of the following information: shape information of tensor data, and spatial information of tensor data. Shape information of the tensor data can be used to determine the data address of the tensor data corresponding to the operand in the data storage space. The spatial information of the tensor data can be used to determine dependencies between instructions, which in turn can determine, for example, the order of execution of the instructions.

In one possible implementation, the spatial information of the tensor data may be indicated by a spatial Identification (ID). The spatial ID may also be referred to as a spatial alias, which refers to a spatial region for storing corresponding tensor data, and the spatial region may be a continuous space or a multi-segment space. Different spatial IDs indicate that there is no dependency on the spatial region pointed to.

Various possible implementations of shape information of tensor data are described in detail below in conjunction with the figures.

Tensors may contain multiple forms of data composition. The tensors may be of different dimensions, e.g. a scalar may be regarded as a 0-dimensional tensor, a vector may be regarded as a 1-dimensional tensor, and a matrix may be a 2-or higher-than-2-dimensional tensor. The shape of the tensor includes information such as the dimensions of the tensor, the sizes of the dimensions of the tensor, and the like. For example, for a three-dimensional tensor:

x ₃ =[[[1，2，3，[4，5，6]]；[[7，8，9]，[10，11，12]]]

the shape or dimension of the tensor can be expressed as X ₃ = (2,2,3), i.e. the tensor is represented by three parameters as a three-dimensional tensor with a size of the first dimension of 2, a size of the second dimension of 2 and a size of the third dimension of 3. When storing tensor data in a memory, the shape of the tensor data cannot be determined according to the data address (or the storage area) of the tensor data, and further, relevant information such as the interrelation among a plurality of tensor data cannot be determined, which results in low efficiency of access of the processor to the tensor data.

In one possible implementation, the shape of the N-dimensional tensor data may be indicated with a descriptor, N being a positive integer, e.g. N =1, 2 or 3, or zero. The three-dimensional tensor in the above example can be represented by a descriptor (2, 3). It should be noted that the present disclosure is not limited to the way the descriptors indicate the tensor shape.

In one possible implementation, the value of N may be determined according to the dimension (also referred to as the order) of the tensor data, or may be set according to the usage requirement of the tensor data. For example, when the value of N is 3, the tensor data is three-dimensional tensor data, and the descriptor may be used to indicate the shape (e.g., offset, size, etc.) of the three-dimensional tensor data in three dimensional directions. It should be understood that the value of N can be set by those skilled in the art according to actual needs, and the disclosure does not limit this.

Although tensor data can be multidimensional, there is a correspondence between tensors and storage on memory because the layout of memory is always one-dimensional. Tensor data is typically allocated in contiguous memory space, i.e., the tensor data can be one-dimensionally expanded (e.g., line first) for storage on memory.

This relationship between the tensor and the underlying storage may be represented by an offset of a dimension (offset), a size of a dimension (size), a step size of a dimension (stride), and so on. The offset of a dimension refers to the offset in that dimension from a reference position. The size of a dimension refers to the size of the dimension, i.e., the number of elements in the dimension. The step size of a dimension refers to the interval between adjacent elements in that dimension, e.g., the step size of the above three-dimensional tensor is (6, 3, 1), i.e., the step size of the first dimension is 6, the step size of the second dimension is 3, and the step size of the third dimension is 1.

FIG. 4 shows a schematic diagram of a data storage space according to an embodiment of the present disclosure. As shown in fig. 4, the data storage space 41 stores a two-dimensional data in a line-first manner, which can be represented by (X, Y) (wherein the X-axis is horizontally to the right and the Y-axis is vertically to the bottom). The size in the X-axis direction (the size of each row, or the total number of columns) is ori _ X (not shown), the size in the Y-axis direction (the total number of rows) is ori _ Y (not shown), and the starting address PA _ start (the base address) of the data storage space 41 is the physical address of the first data block 42. The data block 43 is partial data in the data storage space 41, and its offset amount 45 in the X-axis direction is denoted as offset _ X, the offset amount 44 in the Y-axis direction is denoted as offset _ Y, the size in the X-axis direction is denoted as size _ X, and the size in the Y-axis direction is denoted as size _ Y.

In a possible implementation manner, when the descriptor is used to define the data block 43, the data reference point of the descriptor may use the first data block of the data storage space 41, and the reference address of the descriptor may be the starting address PA _ start of the data storage space 41. The content of the descriptors of data block 43 may then be determined in combination with the size ori _ X of data storage space 41 in the X-axis, the size ori _ Y in the Y-axis, and the offset amount offset _ Y of data block 103 in the Y-axis direction, the offset amount offset _ X in the X-axis direction, the size _ X in the X-axis direction, and the size _ Y in the Y-axis direction.

In one possible implementation, the content of the descriptor can be represented using the following formula (1):

it should be understood that although the content of the descriptor is represented by a two-dimensional space in the above examples, a person skilled in the art can set the specific dimension of the content representation of the descriptor according to practical situations, and the disclosure does not limit this.

In one possible implementation manner, a reference address of the data reference point of the descriptor in the data storage space may be appointed, and based on the reference address, the content of the descriptor of the tensor data is determined according to the positions of at least two vertexes located at diagonal positions in the N dimensional directions relative to the data reference point.

For example, a reference address PA _ base of a data reference point of the descriptor in the data storage space may be agreed. For example, one data (for example, data with a position of (2, 2)) may be selected as a data reference point in the data storage space 41, and the physical address of the data in the data storage space may be used as the reference address PA _ base. The content of the descriptor of the data block 43 in fig. 4 can be determined from the positions of the two vertices of the diagonal positions with respect to the data reference point. First, the positions of at least two vertices of the diagonal positions of the data block 43 relative to the data reference point are determined, for example, the positions of the diagonal position vertices relative to the data reference point in the top-left to bottom-right direction are used, wherein the relative position of the vertex at the top-left corner is (x _ min, y _ min), and the relative position of the vertex at the bottom-right corner is (x _ max, y _ max), and then the content of the descriptor of the data block 43 can be determined according to the reference address PA _ base, the relative position of the vertex at the top-left corner (x _ min, y _ min), and the relative position of the vertex at the bottom-right corner (x _ max, y _ max).

In one possible implementation, the content of the descriptor (with reference to PA _ base) can be represented using the following equation (2):

it should be understood that although the above examples use the vertex of two diagonal positions of the upper left corner and the lower right corner to determine the content of the descriptor, the skilled person can set the specific vertex of at least two vertices of the diagonal positions according to the actual needs, and the disclosure does not limit this.

In one possible implementation, the content of the descriptor of the tensor data can be determined according to the reference address of the data reference point of the descriptor in the data storage space and the mapping relation between the data description position and the data address of the tensor data indicated by the descriptor. For example, when tensor data indicated by the descriptor is three-dimensional spatial data, the mapping relationship between the data description position and the data address may be defined by using a function f (x, y, z).

In one possible implementation, the content of the descriptor can be represented using the following equation (3):

in one possible implementation, the descriptor is further used to indicate an address of the N-dimensional tensor data, wherein the content of the descriptor further includes at least one address parameter representing the address of the tensor data, for example the content of the descriptor may be the following equation (4):

where PA is the address parameter. The address parameter may be a logical address or a physical address. When the descriptor is analyzed, the PA may be used as any one of a vertex, a middle point, or a preset point of the vector shape, and the corresponding data address may be obtained by combining the shape parameters in the X direction and the Y direction.

In one possible implementation, the address parameter of the tensor data includes a reference address of a data reference point of the descriptor in a data storage space of the tensor data, and the reference address includes a start address of the data storage space.

In one possible implementation, the descriptor may further include at least one address parameter representing an address of the tensor data, for example, the content of the descriptor may be the following equation (5):

wherein PA _ start is a reference address parameter, which is not described again.

It should be understood that, the mapping relationship between the data description location and the data address can be set by those skilled in the art according to practical situations, and the disclosure does not limit this.

In a possible implementation manner, a default base address may be set in a task, the base address is used by descriptors in instructions under the task, and the descriptor content may include shape parameters based on the base address. This base address may be determined by setting environmental parameters for the task. The relevant description and usage of the base address can be found in the above embodiments. In this implementation, the contents of the descriptor can be mapped to the data address more quickly.

In a possible implementation, the reference address may be included in the content of each descriptor, and the reference address of each descriptor may be different. Compared with the mode of setting a common reference address by using the environment parameters, each descriptor in the mode can describe data more flexibly and use a larger data address space.

In one possible implementation, the data address in the data storage space of the data corresponding to the operand of the processing instruction may be determined according to the content of the descriptor. The calculation of the data address is automatically completed by hardware, and the calculation methods of the data address are different when the content of the descriptor is represented in different ways. The present disclosure does not limit the specific calculation method of the data address.

For example, the content of the descriptor in the operand is expressed by formula (1), the amounts of shift of the tensor data indicated by the descriptor in the data storage space are offset _ x and offset _ y, respectively, and the size is size _ x × size _ y, then the starting data address PA1 of the tensor data indicated by the descriptor in the data storage space _(x,y) The following equation (6) may be used to determine:

PA1 _(x,y) ＝PA_start+(offset_y-1)*ori_x+offset_x (6)

the data start address PA1 determined according to the above formula (6) _(x,y) In combination with the offsets offset _ x and offset _ y and the sizes size _ x and size _ y of the storage area, the storage area of the tensor data indicated by the descriptor in the data storage space can be determined.

In a possible implementation manner, when the operand further includes a data description location for the descriptor, a data address of data corresponding to the operand in the data storage space may be determined according to the content of the descriptor and the data description location. In this way, a portion of the data (e.g., one or more data) in the tensor data indicated by the descriptor may be processed.

For example, the content of the descriptor in the operand is expressed using formula (2)The tensor data indicated by the descriptor are respectively offset in the data storage space by offset _ x and offset _ y, the size is size _ x × size _ y, and the data description position for the descriptor included in the operand is (x) _q ，y _q ) Then, the data address PA2 of the tensor data indicated by the descriptor in the data storage space _(x,y) The following equation (7) may be used to determine:

PA2 _(x,y) ＝PA_start+(offset_y+y _q -1)*ori_x+(offset_x+x _q ) (7)

in one possible implementation, the descriptor may indicate the data of the block. The data partitioning can effectively accelerate the operation speed and improve the processing efficiency in many applications. For example, in graphics processing, convolution operations often use data partitioning for fast arithmetic processing.

FIG. 5 shows a schematic diagram of data chunking in a data storage space, according to an embodiment of the present disclosure. As shown in FIG. 5, the data storage space 500 stores two-dimensional data, also in a row-first manner, which may be represented by (X, Y) (where the X-axis is horizontally to the right and the Y-axis is vertically down). The dimension in the X-axis direction (the dimension of each row, or the total number of columns) is ori _ X (not shown), and the dimension in the Y-axis direction (the total number of rows) is ori _ Y (not shown). Unlike the tensor data of fig. 4, the tensor data stored in fig. 5 includes a plurality of data blocks.

In this case, the descriptor requires more parameters to represent the data chunks. Taking the X axis (X dimension) as an example, the following parameters may be involved: ori _ x, x.tile.size (size in tile 502), x.tile.stride (step size in tile 504, i.e. the distance between the first point of the first tile and the first point of the second tile), x.tile.num (number of tiles, shown as 3 tiles in the figure), x.stride (overall step size, i.e. the distance from the first point of the first row to the first point of the second row), etc. Other dimensions may similarly include corresponding parameters.

In one possible implementation, the descriptor may include an identification of the descriptor and/or the content of the descriptor. The identifier of the descriptor is used to distinguish the descriptor, for example, the identifier of the descriptor may be its number; the content of the descriptor may include at least one shape parameter representing the shape of the tensor data. For example, the tensor data is 3-dimensional data, of three dimensions of the tensor data, in which shape parameters of two dimensions are fixed, the content of the descriptor thereof may include a shape parameter representing another dimension of the tensor data.

In one possible implementation, the identity and/or content of the descriptor may be stored in a descriptor storage space (internal memory), such as a register, an on-chip SRAM or other media cache, or the like. The tensor data indicated by the descriptors may be stored in a data storage space (internal memory or external memory), such as an on-chip cache or an off-chip memory, etc. The present disclosure does not limit the specific location of the descriptor storage space and the data storage space.

In one possible implementation, the identity, content, and tensor data indicated by the descriptors may be stored in the same block of internal memory, e.g., a contiguous block of on-chip cache may be used to store the relevant content of the descriptors at ADDR0-ADDR1023. Here, the addresses ADDR0 to ADDR63 may be used as a descriptor storage space to store the identifier and the content of the descriptor, and the addresses ADDR64 to ADDR1023 may be used as a data storage space to store tensor data indicated by the descriptor. In the descriptor memory space, the identifiers of the descriptors may be stored with addresses ADDR0-ADDR31, and addresses ADDR32-ADDR 63. It should be understood that the address ADDR is not limited to 1 bit or one byte, and is used herein to mean one address, which is a unit of one address. The descriptor storage space, the data storage space, and their specific addresses may be determined by those skilled in the art in practice, and the present disclosure is not limited thereto.

In one possible implementation, the identity of the descriptors, the content, and the tensor data indicated by the descriptors may be stored in different areas of internal memory. For example, a register may be used as a descriptor storage space, the identifier and the content of the descriptor may be stored in the register, an on-chip cache may be used as a data storage space, and tensor data indicated by the descriptor may be stored.

In one possible implementation, where a register is used to store the identity and content of a descriptor, the number of the register may be used to represent the identity of the descriptor. For example, when the number of the register is 0, the identifier of the descriptor stored therein is set to 0. When the descriptor in the register is valid, an area in the buffer space can be allocated for storing the tensor data according to the size of the tensor data indicated by the descriptor.

In one possible implementation, the identity and content of the descriptors may be stored in an internal memory and the tensor data indicated by the descriptors may be stored in an external memory. For example, the identification and content of the descriptors can be stored on-chip, and the tensor data indicated by the descriptors can be stored off-chip.

In one possible implementation, the data address of the data storage space corresponding to each descriptor may be a fixed address. For example, separate data storage spaces may be divided for tensor data, each of which has a one-to-one correspondence with descriptors at the start address of the data storage space. In this case, a circuit or module (e.g., an entity external to the disclosed computing device) responsible for parsing the computation instruction may determine the data address in the data storage space of the data corresponding to the operand from the descriptor.

In one possible implementation, when the data address of the data storage space corresponding to the descriptor is a variable address, the descriptor may be further used to indicate an address of the N-dimensional tensor data, wherein the content of the descriptor may further include at least one address parameter indicating the address of the tensor data. For example, the tensor data is 3-dimensional data, when the descriptor points to the address of the tensor data, the content of the descriptor may include one address parameter indicating the address of the tensor data, such as the starting physical address of the tensor data, or may include a plurality of address parameters of the address of the tensor data, such as the starting address of the tensor data + the address offset, or the address parameters of the tensor data based on each dimension. The address parameters can be set by those skilled in the art according to practical needs, and the disclosure does not limit this.

In one possible implementation, the address parameter of the tensor data may include a reference address of a data reference point of the descriptor in a data storage space of the tensor data. Wherein the reference address may be different according to a variation of the data reference point. The present disclosure does not limit the selection of data reference points.

In one possible implementation, the base address may comprise a start address of the data storage space. When the data reference point of the descriptor is the first data block of the data storage space, the reference address of the descriptor is the start address of the data storage space. When the data reference point of the descriptor is data other than the first data block in the data storage space, the reference address of the descriptor is the address of the data block in the data storage space.

In one possible implementation, the shape parameters of the tensor data include at least one of: the size of the data storage space in at least one direction of the N dimensional directions, the size of the storage area in at least one direction of the N dimensional directions, the offset of the storage area in at least one direction of the N dimensional directions, the positions of at least two vertexes located at diagonal positions of the N dimensional directions relative to the data reference point, and the mapping relationship between the data description position of the tensor data indicated by the descriptor and the data address. Where the data description position is a mapping position of a point or a region in the tensor data indicated by the descriptor, for example, when the tensor data is 3-dimensional data, the descriptor may represent a shape of the tensor data using three-dimensional space coordinates (x, y, z), and the data description position of the tensor data may be a position of a point or a region in the three-dimensional space to which the tensor data is mapped, which is represented using three-dimensional space coordinates (x, y, z).

It should be understood that shape parameters representing tensor data can be selected by one skilled in the art based on practical considerations, which are not limited by the present disclosure. By using the descriptor in the data access process, the association between the data can be established, thereby reducing the complexity of data access and improving the instruction processing efficiency.

Embodiments of the present disclosure provide a data processing apparatus that performs an operation related to data fusion of tensor data according to a special fusion instruction based on the aforementioned hardware environment. As mentioned in the background, the sparsification of network parameters can effectively reduce the computational and storage requirements of the network. But the subsequent processing can also be influenced by the thinning of the network parameters. For example, in sparse matrix multiplication operations, ordered accumulation of vectors derived from the middle of the operation may be required to obtain the desired operation result. For another example, in the radar algorithm, it is necessary to perform fusion processing on data thinned out in object detection by radar. In view of this, embodiments of the present disclosure provide a hardware solution for data fusion processing to simplify and speed up such processing.

FIG. 6 illustrates exemplary principles of a data fusion process according to an embodiment of the present disclosure. The figure exemplarily shows 4 ways of data to be fused, and each way of data comprises 6 data elements. The data elements may be scalars, vectors, or higher dimensional tensors. The data elements are shown as vectors, e.g., D11, D12, \8230; D46. These vectors have a uniform vector length, e.g., D11 is (D1, D2, D3, \ 8230;, dn) and the length is n. Each data element has an associated index for indicating location information of the data element in a corresponding way of data. For example, the original path of data may include 1000 data elements, but only some data elements at positions are valid, and at this time, the valid elements may be extracted to form the data to be fused, and indexes corresponding to the valid elements are extracted to indicate their positions in the original data, where the indexes form the indexes to be fused.

The figure schematically shows 4 corresponding ways of indexes to be fused, and each way of index corresponds to one way of data to be fused. The 1 st way index is used for identifying the position information of each data element in the 1 st way data, the 2 nd way index is used for identifying the position information of each data element in the 2 nd way data, and so on. Furthermore, the index elements in each path of index are stored in order and correspond to the data elements in the corresponding path of data one to one. In the example in the figure, the index elements in each way of index are arranged according to a first order (for example, from small to large), and the data elements in each way of data are also arranged in order according to the order of the corresponding index. For example, the 1 st index element in the 1 st way index indicates that the index of the 1 st data element in the 1 st way data is 0, i.e. the first element; the 2 nd index element in the 1 st way index indicates that the index of the 2 nd data element in the 1 st way data is 2, namely, the 3 rd element; and so on.

After data fusion, the 4 paths of data are merged into one path of ordered fusion data according to the corresponding indexes, and data elements with the same index are merged into one fusion data element. As shown, the merged index includes 16 index elements arranged in a second order (e.g., a small to large order), wherein duplicate index elements in the index to be merged are removed, as indicated by the dark squares in the figure. Correspondingly, the fused data also comprises 16 data elements which are arranged in order according to the corresponding indexes, and the data elements with the same indexes are merged, as shown by dark squares in the figure. Since the data elements may be vectors or tensors of higher dimensions, in some embodiments of the present disclosure, at least for the merging of data elements with the same index, the representation may be in the form of an operational structure element.

In this example, the merging of data elements with the same index is schematically represented by an addition formula. For example, for the post-fusion index element "0", the corresponding fused data element is the sum of the first data elements of the respective ways of data (D11 + D21+ D31+ D41). For another example, for the fused index element "9", the fused data element corresponding to the fused index element is the sum (D15 + D43) of the 5 th data element of the 1 st way and the 3 rd data element of the 4 th way. When the data elements are vectors, the fused data elements are the corresponding vector sums. The specific representation of the fused data elements will be described later.

It will be appreciated by those skilled in the art that the first and second sequences referred to above may or may not be the same and both may be selected from any of the following: in order of small to large, or in order of large to small. It will also be appreciated by those skilled in the art that although the figures illustrate each way of data to have an equal number of data elements, the number of data elements in each way of data may be the same or different, and the disclosure is not limited in this respect.

Fig. 7 shows a block diagram of a data processing device 700 according to an embodiment of the present disclosure. The data processing apparatus 700 may be implemented, for example, in the computing apparatus 201 of fig. 2. As shown, the data processing apparatus 700 may include a control circuit 710, a storage circuit 720, an arithmetic circuit 730, and a tensor interface circuit 712.

The control circuit 710 may function similar to the control module 31 of fig. 3, and may include, for example, an instruction fetch unit to fetch an instruction from, for example, the processing device 203 of fig. 2, and an instruction decode unit to decode the fetched instruction and send the decoded result as control information to the operation circuit 730 and the storage circuit 720.

In one embodiment, the control circuit 710 may be configured to parse a fuse instruction, wherein the fuse instruction instructs a fusing process to be performed on the plurality of paths of data to be fused.

The memory circuit 720 may be configured to store various information including at least information before and/or after the fusion process. The memory circuit may be, for example, WRAM 332 of fig. 3.

The arithmetic circuitry 730 may be configured to perform corresponding operations according to the fused instruction. Specifically, the operation circuit 730 may merge the data elements in the multiple paths of data to be merged into an ordered path of merged data according to the corresponding indexes, where at least the data elements with the same index are merged into the operation structure element. The data elements may be any of scalar, vector, or higher dimensional data.

In one embodiment, the arithmetic circuit 730 may further include an arithmetic processing circuit (not shown), which may be configured to pre-process data before the arithmetic circuit performs the arithmetic operation or post-process data after the arithmetic operation according to the arithmetic instruction. In some application scenarios, the aforementioned pre-processing and post-processing may, for example, include data splitting and/or data splicing operations.

A Tensor Interface Unit (TIU) 712 may be configured to implement operations associated with the descriptors under control of the control circuit 710. These operations may include, but are not limited to, registration, modification, deregistration, resolution of descriptors; reading and writing descriptor content, etc. The present disclosure does not limit the specific hardware type of tensor interface unit. In this way, the operations associated with the descriptors can be implemented by dedicated hardware, further improving the efficiency of tensor data access.

In some embodiments, tensor interface circuit 712 may be configured to parse shape information of tensor data included in an operand of an instruction to determine a data address in the data storage space of data corresponding to the operand.

Alternatively or additionally, in still other embodiments, tensor interface circuit 712 may be configured to compare spatial information (e.g., spatial IDs) of tensor data included in operands of two instructions to determine dependencies of the two instructions to determine out-of-order execution, synchronization, etc. operations of the instructions.

Although control circuit 710 and tensor interface circuit 712 are shown in fig. 7 as two separate blocks, those skilled in the art will appreciate that these two circuits may also be implemented as one block or more blocks, and the present disclosure is not limited in this respect.

The operational circuitry may be implemented in a variety of ways. FIG. 8 illustrates an exemplary circuit diagram for a data fusion process according to one embodiment of the present disclosure.

As shown, in one embodiment, the memory circuit may be illustratively divided into two parts: a first storage circuit 822 and a second storage circuit 824.

The first storage circuit 822 may be configured to store K ways of data to be fused and K ways of indices corresponding to the K ways of data, K >1. The index elements in the K-way index indicate index information of corresponding data elements in the K-way data, that is, the index elements and the data elements have a one-to-one correspondence relationship. In addition, the index elements of each of the K ways of the index are ordered in a first order, and the data elements of each of the K ways of the data are ordered in an order of the corresponding index.

The figure illustrates the 4-way index shown in figure 6 and the corresponding 4-way data. Since the data elements may be vectors or higher dimensional data, for simplicity, the memory addresses of these data elements may be used instead, with each data element being identified by the symbol Pt in the figure, representing a pointer to the memory address of the corresponding data element (e.g. vector, three-dimensional tensor, etc.). It will be appreciated that specific data elements may also be stored in the memory circuit, which is not shown for clarity. In some embodiments, each way index may be stored contiguously, e.g., as a vector, such that the way index/index vector may be accessed based on the starting address of each way index or the starting address of the vector. Correspondingly, each way of data may also be stored continuously, for example as a vector, so that each way of data/data vector may be accessed according to the start address of the data or the start address of the vector, but in this case, the vector element in the data vector is a pointer to the final data element.

The second storage circuit 824 may be configured to store the fusion-processed data output by the arithmetic circuit. These data may include: and sorting the fused fusion indexes and fusion data of the K-path indexes. Fusion index elements in the fusion index are ordered and arranged according to a second order, fusion data elements in the fusion data correspond to the fusion index elements one to one, and each fusion data element is associated with one or more operation structure elements according to the order ordered and arranged of the fusion index.

As can be seen from the example in the figure, the 4-way data to be fused becomes a way of fused data, and the corresponding 4-way index also becomes a way of fused index, wherein the fused index elements are arranged in the order from small to large, and the index elements with the same size are removed. The corresponding fused data elements are arranged according to the sequence of the fused index, and each fused data element can be an address pointer pointing to the corresponding final data element. In some embodiments, the storage space for each final data element may be a fixed size. Thus, when these data elements are stored consecutively, the subsequent address may be determined by offsetting the first address by a fixed amount. For example, it is exemplarily shown that the 1 st fused data element may be the address base _ addr, and without assuming that the address size occupied by each final data element is offset, the 2 nd fused data element may be the address base _ addr + offset, the 3 rd fused data element may be the address base _ addr +2 × offset, and so on.

Some of the final data elements pointed by the addresses are original data elements because the indexes of the final data elements are unique in the indexes to be fused and do not need to be accumulated; some of them require accumulation because there are multiple original data elements with the same index. In embodiments of the present disclosure, at least for fused data elements composed of data elements having the same index, the final data element to which it points may not be obtained immediately by the operation, but rather represented by the associated operation structure element. The operational structure elements will be described in detail later in conjunction with specific circuits.

In some embodiments, the arithmetic circuitry may include ordering circuitry 832 and output circuitry 836 to cooperatively implement an ordering accumulation fusion function. In particular, the sorting circuit 832 is configured to sort the K-way indices by the size of the index elements and output the sorted K-way indices to the output circuit 836. Output circuitry 836 may then generate, at least when the same index element is received from sorting circuitry, an operation structure element representing an accumulation operation of data elements corresponding to the same index element, and remove duplicate index elements.

In some embodiments, sorting circuit 832 may include a comparison circuit 831 and a buffer circuit 833. The comparison circuit 831 performs a comparison function that compares the sizes of the index elements in the multiple indexes to be fused and submits the comparison result to the control circuit 810 for sorting. The control circuit 810 determines the insertion position of the index element in the buffer circuit 833 according to the comparison result. The buffer circuit 833 is used to buffer the compared index elements and the information of the data elements corresponding to the index elements, and buffer the compared index elements in the size order.

Specifically, the comparing circuit 831 may be configured to compare an index element in the index to be fused with an index element that is not output in the buffer circuit 833, and output a comparison result to the control circuit 810. And the buffer circuit 833 can be configured to store the compared index elements and the information of the data elements corresponding thereto in order and output the compared index elements and the information of the data elements corresponding thereto in order according to the control of the control circuit 810.

In some embodiments, buffer circuit 833 may be configured to buffer K index elements, the K index elements being sorted by size. Those skilled in the art will appreciate that the buffer circuit may also be configured to buffer more index elements, and embodiments of the present disclosure are not limited in this respect. Depending on the ordering in buffer circuit 833 and the ordering desired to be output, such as from small to large, or from large to small, the first index element or the last index element in the current sequence may be output in a specified order each time. For example, in the example in the figure, buffer 833 buffers the index elements from left to right in descending order, and outputs the rightmost index element at a time, that is, the smallest index element in the current sequence, for example, "7".

In these embodiments, the comparing circuit 831 may include a K-1 comparator configured to compare the index elements to be fused with the index elements that are not output in the buffer circuit 833, that is, the index elements that are left after the first or last index element is output in the current sequence, and generate a comparison result and output the comparison result to the control circuit 810.

For example, for 4-way data to be merged, a 3-way comparator is shown that compares the specified index element (9 at this time) received from the first storage circuit 822 with the 3 index elements currently not output in the buffer circuit 833, three

index elements

100,10 and 9 on the left in the figure.

In some embodiments, the comparison results of the comparators may be represented using a bitmap. For example, if the index element to be fused (e.g., 9) is greater than or equal to the index element in the buffer circuit, the comparator may output "1", otherwise "0" is output; the reverse is also possible. In the example in the figure, the comparison result of the index element to be fused (9) with the respective index elements (100, 10, and 9) in the buffer circuit is "001", and is output to the control circuit 810.

The control circuit 810 may be configured to determine, according to the received comparison result, an insertion position of the index element to be fused in the current sequence of the buffer circuit 833. Specifically, the control circuit 810 may be further configured to determine the insertion position according to the change position of the bit in the bitmap. In the example in the figure, the comparison result is "001", which indicates that the index element to be currently fused is smaller than the 1 st and 2 nd index elements from the left in the buffer circuit, and is greater than or equal to the 3 rd index element from the left, and the insertion position is between the 2 nd index element and the 3 rd index element, that is, "10" and "9".

In some embodiments, the buffer circuit 833 may be configured to insert the index element to be fused in the insertion position as instructed by the control circuit 810. In the example in the figure, the sequence after the insertion of this index element in the buffer circuit 833 becomes "100,10, 9".

To enable retrieval of data corresponding to the index during the fusion process, in some embodiments, the buffer circuit 833 may be further configured to: and storing the compared index elements and the data elements corresponding to the index elements in order according to the value sequence of the index elements. As shown, the buffer circuit 833 buffers the index element and also buffers the information of the corresponding data element. Therefore, each time the index element is compared to determine the insertion position, the data element information corresponding to the index element may also be inserted into the cache circuit. Since a data element may be a scalar, vector, or higher-dimensional tensor, the data element may be represented using an address pointing to the data element. For example, in the example in the figure, each element in the K ways of data to be fused is an address, pointing to the corresponding data element, respectively, regardless of whether the data element is a scalar, vector, or high-dimensional tensor. In the description herein, a data element may refer to an address, and may also refer to the final scalar, vector, or higher-dimensional tensor data, the meaning of which may be distinguished by one skilled in the art based on the context description.

Next, the buffer circuit 833 can output the rightmost index element "9". At this time, the control circuit 810 may be further configured to determine, according to the index element output in the buffer circuit, access information of a next index element to be merged. Specifically, the control circuit takes out the next index element to be merged from the way index according to which way index the output index element belongs to in the K way index, and sends the next index element to be merged to the comparison circuit 831 for comparison.

Further, in the case of an ordered output, the buffer circuit 833 may be configured to output the compared index elements in order (for example, from small to large) in the value order of the index elements as a fused index, and synchronously output the data elements corresponding thereto as fused data. The output data is provided to output circuitry 836 for further processing.

For clarity, the figure also shows the index sequence buffered as the sorting progresses in the buffer circuit 833. As shown in the figure, initially, the first index element of each way in the K-way index is stored in the buffer circuit 833 in descending order. In some implementations, these 4 index elements may be fetched, sorted, and stored in buffer circuitry at once. In other implementations, the data in the buffer circuit may be initialized to a negative number, the first index element of each way index may be fetched one by one in order (e.g., in order from way 1 to way 4), compared to the data in the buffer circuit, and put in place. In this example, the first index element of the 4-way index is all 0, and therefore, the index elements can be arranged according to the order of the taking, and the sequence numbers of the way indexes, for example, "0" of the 1 st way is placed at the rightmost side, the "0" of the 2 nd way is placed at the 2 nd position at the right side, and so on.

Then, the rightmost "0" belonging to the 1 st way in the buffer circuit is output. According to which path index the output index element belongs to, the next index element to be fused is taken out from the corresponding path index, namely the 2 nd index element '2' of the 1 st path. "2" is fed to the comparison circuit to be compared with the remaining three "0" s in the buffer circuit, and the comparison result is "111", that is, it is larger than the existing three "0" s in the buffer circuit, so "2" is inserted at the end of the sequence, and the sequence in the buffer circuit becomes "2, 0".

Then, the rightmost "0" belonging to the 2 nd way in the buffer circuit is outputted, so that the 2 nd element "3" of the 2 nd way is taken out and compared with the remaining "2, 0" in the buffer circuit, and the comparison result is "111", so that "3" is inserted at the rearmost end of the sequence, at which time the sequence in the buffer circuit becomes "3,2, 0".

Then, the rightmost "0" belonging to the 3 rd way in the buffer circuit is outputted, so that the 2 nd element "100" of the 3 rd way is taken out and compared with the remaining "2, 0" in the buffer circuit, and the comparison result is "111", so that "100" is inserted at the rearmost end of the sequence, at which time the sequence in the buffer circuit becomes "100,3,2,0".

Then, the rightmost element "0" belonging to the 4 th path of the output buffer circuit is compared with the remaining "100,3,2" in the buffer circuit to extract the 2 nd element "2" in the 4 th path, and the comparison result is "001", so that "2" is inserted after the rightmost element 1 in the sequence, and the sequence in the buffer circuit becomes "100,3, 2".

By analogy, the index elements in the K-way index can be compared one by one, and are inserted into the proper position in the buffer circuit according to the size sequence, and then are output by the buffer circuit. For example, the smallest index element output by the buffer circuit at a time may be output to the output circuit 836 in order. It will be understood by those skilled in the art that if the space of the buffer circuit is sufficient, the merged sorted elements can be uniformly output after the sorting is completed.

As can be seen from the output merged sorted index elements, when there are index elements of the same size, sorting circuitry 832 still retains these index elements of the same size and does not perform deduplication operations, but rather provides them to output circuitry 836 for processing.

In some embodiments, the output circuit 836 may include a comparator 837, a buffer 835, and a structure generator 839.

The comparator 837 may be configured to compare the index element sequentially output from the sorting circuit 832 with the last fused index element and output a comparison result. The comparison results may be "1" indicating the same, "0" indicating different; and vice versa.

The buffer 835 may be configured to control an output index element according to the comparison result of the comparator 837. In some embodiments, buffer 835 may output the current index element as the new fused index element only if the comparison results indicate non-identity. In other words, when the comparison result indicates the same, the buffer 835 does not output the current index element, i.e., discards the index element that is duplicated from the last merged index element. As shown, there are no duplicate fuse index elements of the fuse index in the second storage circuit 824.

The structure generator 839 may be configured to control the accumulation of the data elements according to the comparison result of the comparator 837 to generate corresponding operation structure elements. Specifically, at least when the comparison result indicates the same, an operation structure element representing an accumulation operation of accumulating the current data element to the fused data element corresponding to the last fused index element is generated based on the data element corresponding to the index element.

One way of generating an operational structure element is shown in the embodiment of fig. 8. In this embodiment, in addition to the operation structure element being generated for the fused data element into which the data elements having the same index are fused, the operation structure element is also generated for the other fused data elements. That is, each fused data element is characterized using an operation structure element. Specifically, in such an embodiment, when the comparison result of the comparator 837 indicates that they are not the same, an operation structure element is generated based on the data element corresponding to the current index element, wherein the operation structure element indicates that the data element corresponding to the current index element is added to the data of the specified address, which may be set to a value of 0 in advance. By the method, the operation structure elements can be generated for all the fused data elements, so that the expression mode is unified, the operation is simplified, and flexible processing is provided for subsequent operation.

Fig. 8 illustrates an operation structure generated in the present embodiment. As shown, when the output circuit 836 receives the 1 st "0" output from the sorting circuit 832, since this is the first element, the comparator 837 outputs, for example, "0" to indicate that it is not the same as the last fused index element (e.g., initialized to a negative number). The buffer 835 outputs the index "0" as the first fused index element. The structure generator 839 assigns an address to the new fused data element based on the result of the comparator 837, e.g., the address of the first fused data element is base _ addr. At this time, even if the indices are considered to be different, the associated operation structure element is generated for the fused data element, which is simply an accumulation operation in which the current data element is added to 0. In one embodiment, the operation structure element may indicate an in-situ addition operation, which includes addresses pointing to two addends, e.g., { base _ addr, pt11}, where the data in the address pointed to by base _ addr is preset to a value of 0, and Pt11 represents the address of the current data element, and the result of the addition of the two is still stored in the address pointed to by base _ addr, which corresponds to the address of the first fused data element allocated above. The generated operation structure element may be stored in the second storage circuit 824.

As described above, each of the operation structure elements may include two elements indicating two addend addresses, respectively, for example, { src0_ addr, src1_ addr }, where src0_ addr denotes an address of the 1 st addend, src1_ addr denotes an address of the 2 nd addend, and the addition result is stored in the address pointed to by src0_ addr, that is, the in-situ addition operation is performed.

Then, the sorting circuit 832 outputs the 2 nd "0", and since the last fused index element is "0", the comparator 837 outputs, for example, "1" to indicate the same as the last fused index element ("0"). At this time, buffer 835 does not output. The structure body generator 839 considers that no new fused data element is generated based on the result of the comparator 837, and therefore no address is assigned, that is, no new element is present in the fused data. At this time, the structure generator 839 also generates an operation structure element indicating data element accumulation, which accumulates the data element into the fused data element corresponding to the last fused index element, for example, { base _ addr, pt21}, where base _ addr indicates the address of the fused data element corresponding to the last fused index element, that is, the address of the first fused data element allocated in the last step, pt21 indicates the address of the current data element, and the addition result is still stored in base _ addr, that is, the data element is accumulated onto the first fused data element allocated above. The generated operation structure element is also stored in the second storage circuit 824.

When the sorting circuit 832 continues to output the 3 rd and 4 th "0", the fusion index will not increase, the fusion data element in the corresponding fusion data will not increase, and the structure generator will output the associated structure operation element: { base _ addr, pt31}, { base _ addr, pt41}.

Then, the sorting circuit 832 outputs the 1 st "2", referring to the previous step, when the buffer 835 outputs the index "2" as the 2 nd fused index element. The structure generator 839 assigns an address to the new fused data element based on the result of the comparator 837, for example adding an offset to the address of the first fused data element. The offset is the memory address size of the corresponding final data element (e.g., scalar, vector, or higher-dimensional tensor). At this time, an associated arithmetic structural element, for example, { base _ addr + offset, pt12} is generated for the new fused data element, and similarly, data in the address pointed to by base _ addr + offset is set to a value of 0 in advance. The generated operation structure element is stored in the second storage circuit 824.

When the 2 nd "2" is output by the sorting circuit 832, the fusion index will not increase, the fusion data element in the corresponding fusion data will not increase, and the structure generator will output the associated structure operation element: { base _ addr + offset, pt42}.

Other fusion results can be similarly derived by those skilled in the art according to the above description, and are not described one by one here.

Those skilled in the art will appreciate that other forms of hardware circuitry may be devised to implement the merge-sort fusion process described above, and the present disclosure is not limited in this respect.

In the disclosed embodiments, the merge sort fusion process of data may be implemented using the above-described exemplary hardware circuit by calling a fusion instruction. At this time, the operation object of the fusion instruction includes input K-way data to be fused, a K-way index corresponding to the K-way data, the size of the K-way data, and an output one-way fusion index and one-way fusion data, where K >1. In the objects, index elements in the K-way index indicate index information of corresponding data elements in the K-way data; the index elements of each path of index in the K paths of indexes are orderly arranged according to a first sequence; the data elements of each path of data in the K paths of data are orderly arranged according to the sequence of the corresponding indexes; the fusion index elements in the fusion index are orderly arranged according to a second sequence, wherein the same index elements are fused into the same fusion index element; the fused data elements in the fused data correspond to the fused index elements one to one, and the fused data elements composed of data elements having the same index are characterized by an operation structure element representing an accumulation operation of the corresponding data elements.

In some embodiments, the operand of the merge instruction may further include a total number of output merge index elements for indicating the number of index elements in an output one-way merge index. It can be understood that, since the fused data and the fused index have a one-to-one correspondence relationship, the operation object is also only the number of data elements in the output path of fused data.

In some embodiments, the operand of the fused instruction may also include the total number of elements of the output operation structure.

As mentioned previously, the first and second orders may be the same or different, and the first and second orders may be selected from any of: in order of small to large, or in order of large to small.

As mentioned before, in a fusion instruction involving tensor data, at least one operand comprises at least one descriptor to indicate shape information and/or spatial information of the tensor data.

FIG. 9 illustrates indications of various descriptors in a fused instruction according to some embodiments of the present disclosure.

As shown, the size offset includes an offset of the K ways input data and an offset of the K ways input index. The data in units of size _ offset preceding the size of the K input data, that is, the number of data in the offset amount of the K input data + the number of data in the offset amount of the K input index = size _ offset, and the ith number of access "size of K input data" is described, and the coordinates can be written as size _ offset + i. The fuse instruction may include an input first descriptor TID0 indicating an address offset of the K-way input data to be fused, an address offset of the K-way input index, and a size of the K-way input data. It is understood that each way of data is stored continuously, but each way of data may have different size, that is, different length, so that by identifying K ways of data by K address offsets, the base address of each way of data, for example, the start address of each way of data, can be determined quickly. Similarly, the K way indices are identified by K address offsets, thereby quickly determining a base address, e.g., a starting address, for each way index. The size of the K paths of input data comprises K elements, the ith element size [ i ] represents the number of data elements in the ith path of data, wherein i is more than 0 and less than or equal to K. Since the K-way data has a one-to-one correspondence with the K-way index, the size [ i ] also indicates the number of index elements in the ith-way index.

Alternatively or additionally, the fused instruction may further comprise an incoming second descriptor TID1 for indicating data of all incoming data. More specifically, TID1 may indicate a reference point address of K-way data, for example, a start address TID1.Base _ addr of the first-way data, whereby the reference point address and an address offset of the K-way data indicated by the first descriptor TID0 may be combined to access the K-way data. For example, the start address of the 2 nd way data can be determined by the start address tid1.Base _ addr + the offset data2_ offset of the 2 nd way data. Based on the starting address of the ways of data, the address of each of the data elements therein may be calculated. For example, the address of the ith data element in the first way data may be calculated as: base _ addr + i co input _ type, where co represents the length of a single data element, where the length of each data element is fixed, and input _ type represents the data type of the input data. Thus, the numerical information of each data element of each data of all K-way input data can be obtained in combination of TID1 and TID 0.

Alternatively or additionally, the fusion instruction may further comprise an input third descriptor TID2 for indicating data of all input indices. More specifically, TID2 may indicate a reference point address of the K-way index, e.g., the start address TID2.Base _ addr of the first way index, whereby the reference point address and an address offset of the K-way index indicated by the first descriptor TID0 may be combined to access the K-way index. For example, the start address of the first way index tid2.Base _ addr + offset index2_ offset of the 2 nd way index may determine the start address of the 2 nd way index. Based on the starting address of the way indices, the address of each of the index elements therein may be calculated. For example, the address of the ith index element in the first way index may be calculated as: base _ addr + i index _ type, where index _ type represents the data type of the input index. Thus, the numerical information of each index element of each way index of all K way input indexes can be obtained in combination of TID2 and TID 0.

Alternatively or additionally, the fused instruction may further comprise an input fourth descriptor TID4 for indicating a storage address of a final run result of the operation structure element. As described above, each fused data element after fusion is characterized by one or more operation structure elements, which represent in-situ addition operations, and the final result obtained after operation, i.e. after performing the addition operation, can be stored in the specified address, i.e. the address indicated by the fourth descriptor TID4. For example, TID4 may indicate the starting address TID4.Base _ addr of the final run result. In some embodiments, each final run result is a fixed length vector, e.g., co, that is the same length as the input data elements. Thus, the address of the fused data element resulting from the ith operation may be calculated as: base _ addr + i co output _ type, where output _ type represents the data type of the output data.

Alternatively or additionally, the fused instruction may further comprise an output fifth descriptor TID3 for indicating the fused index and operation structure elements. In some embodiments, the index elements in the input K-way index are ordered, for example, from small to large, and the last output one-way fused index element may also be ordered from small to large. In the merge-sort fusion process of the present disclosure, when there are duplicate indices, the indices are deduplicated.

Each fused data element may be associated with one or more operational structure elements. For example, when a fused data element is composed of a plurality of input data having the same index, a plurality of arithmetic structure elements are generated to add the input data one by one to the same fused data element. When the fused data element is composed of a single input data, only one operation structure element is generated, which represents an addition operation of the input data and 0-value data of the designated address.

Each operation structure element comprises two sub-elements, each sub-element is an address which is respectively an address of two addends, and one of the addresses is also used as an address of the result sum. In some embodiments, an arithmetic structure element may include two 64-bit elements to represent two addresses.

There may be a variety of operations related to data fusion, such as merge sort processing, sort accumulation, sort fusion processing, and the like. Various instruction schemes may be devised to implement the operations associated with data fusion.

In one arrangement, a fused instruction may be designed, and the fused instruction may include an operation mode bit to indicate different operation modes of the fused instruction, so as to perform different operations.

In another scheme, a plurality of fused instructions can be designed, and each instruction corresponds to one or more different operation modes, so that different operations can be executed. In one implementation, a corresponding blend instruction may be designed for each mode of operation. In another implementation, the operation modes can be classified according to their characteristics, and a blend instruction is designed for each operation mode. Further, when multiple operating modes are included in a certain class of operating modes, an operating mode bit may be included in the fused instruction to indicate the respective operating mode.

Regardless of the scheme, the fused instruction may indicate its corresponding mode of operation via the mode of operation bit and/or the instruction itself.

In the context of the present disclosure, the fused instruction may be a microinstruction or control signal that is executed within one or more multi-stage operation pipelines, which may include (or otherwise indicate) one or more operation operations to be performed by the multi-stage operation pipelines. Depending on different operational scenarios, the operational operations may include, but are not limited to, arithmetic operations such as convolution operations, matrix multiplication operations, logical operations such as and operations, xor operations, or operations, shift operations, or any combination thereof.

FIG. 10 illustrates an exemplary flow diagram of a data processing method 1000 in accordance with an embodiment of the disclosure.

As shown in fig. 10, in step 1010, a fusion instruction is parsed, the fusion instruction indicates that a fusion process is performed on multiple paths of data to be fused, and at least one operand of the fusion instruction includes at least one descriptor, the descriptor indicates at least one of the following information: shape information of the tensor data and spatial information of the tensor data. This step may be performed, for example, by control circuit 710 of fig. 7.

Next, in step 1020, the descriptor is parsed. This step may be performed, for example, by tensor interface circuit 712 of figure 7. Specifically, the data address of the tensor data corresponding to the operand in the data storage space may be determined according to the shape information of the tensor data; and/or determining dependencies between instructions based on spatial information of the tensor data.

Next, in step 1030, based on the parsed descriptor, merging the data elements in the multiple paths of data to be fused into an ordered path of fused data according to their corresponding indexes according to the fusion instruction, where each fused data element is characterized by using an operation structure element, where the data element may include any one of a scalar, a vector, or higher dimensional data.

Finally, in step 1040, the fused data is output.

Steps

1030 and 1040 may be performed, for example, by operational circuitry 730 of fig. 7.

It will be appreciated by a person skilled in the art that the individual steps of the above-described method correspond to the individual circuits described above in connection with the example circuit diagram, respectively, and therefore the features described above may equally be applied to the method steps and are not repeated here.

As can be seen from the above description, the embodiments of the present disclosure provide a fusion instruction for performing a fusion process of multiple paths of data to be fused. In some embodiments, the fusion instruction is a hardware instruction, and the data fusion processing is implemented by a special hardware circuit, so that the processing speed can be increased, and operations related to the thinned processing, such as operations in a radar algorithm, can be better supported. In some embodiments, the merge instruction may merge multiple paths of ordered data into one path of ordered merged data, and data with the same index may be merged and represented in the form of an operation structure, thereby facilitating subsequent calculation processing. In some embodiments, an operation mode bit may be included in the fused instruction to indicate that the fused instruction is a merge-sort fused processing operation, or the fused instruction itself may indicate a merge-sort fused processing operation. By providing a special fusing instruction to perform an operation related to the fusing processing of the multiplexed data, the processing can be simplified.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance instrument, a B ultrasonic instrument and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud, an edge, and a terminal. In one or more embodiments, the computationally-powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while the less-power electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration.

It is noted that for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of acts and combinations thereof, but those skilled in the art will appreciate that the aspects of the present disclosure are not limited by the order of the acts described. Accordingly, it will be appreciated by those skilled in the art in light of the disclosure or teachings of the present disclosure that certain steps therein may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in which acts or modules are involved, which are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure also focuses on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment can also be referred to in other embodiments.

In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of the connection relationships between the different units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed over multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the solution of the embodiment of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned Memory unit or the Memory device may be any suitable Memory medium (including a magnetic Memory medium or a magneto-optical Memory medium, etc.), and may be, for example, a variable Resistance Random Access Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing detailed description of the disclosed embodiments has been presented to enable one of ordinary skill in the art to make and use the principles and implementations of the present disclosure; meanwhile, for the person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as limiting the present disclosure.

Claims

1. A data processing apparatus comprising:

a control circuit configured to parse a fuse instruction, the fuse instruction indicating to perform a fuse process on multiple paths of data to be fused, and at least one operand of the fuse instruction including at least one descriptor indicating at least one of the following information: shape information of tensor data and spatial information of tensor data;

a tensor interface circuit configured to parse the descriptor;

a storage circuit configured to store information before and/or after the fusion process; and

and the operation circuit is configured to merge data elements in the multiple paths of data to be fused into one path of ordered fused data according to corresponding indexes of the data elements based on the analyzed descriptors and the fused instruction, wherein each fused data element is represented by using an operation structure element, and the data elements comprise any one of scalars, vectors or higher-dimensional data.

2. The data processing apparatus according to claim 1, wherein the operation object of the fusion instruction includes input K ways of data to be fused, K ways of indexes corresponding to the K ways of data, sizes of the K ways of data, and an output one way of fusion indexes and one way of fusion data, K >1, where:

index elements in the K-way index indicate index information of corresponding data elements in the K-way data;

the index elements of each path of index in the K paths of indexes are orderly arranged according to a first sequence;

the data elements of each path of data in the K paths of data are orderly arranged according to the sequence of the corresponding indexes;

fusion index elements in the fusion index are orderly arranged according to a second sequence, wherein the same index elements are fused into the same fusion index element; and is

Fused data elements in the fused data correspond to the fused index elements one to one, and fused data elements composed of data elements having the same index are characterized by an operation structure element representing an accumulation operation of the data elements.

3. The data processing apparatus according to claim 2, wherein the first order is the same or different from the second order, and the first and second orders are selected from any one of: in order of small to large, or in order of large to small.

4. A data processing apparatus as claimed in any one of claims 2 to 3, wherein the arithmetic circuitry comprises sorting circuitry and output circuitry, wherein:

the sorting circuit is configured to sort the K-way indexes according to the sizes of the index elements and output the K-way indexes to the output circuit in order; and is

The output circuitry is configured to generate, in response to receiving from the sorting circuitry an index element that is the same as a previous fused index element, an operation structure element representing an accumulation operation that accumulates a data element corresponding to the current index element to a fused data element corresponding to the previous fused index element, and to remove duplicate index elements.

5. The data processing apparatus of claim 4, wherein the output circuit is further configured to:

in response to receiving from the sorting circuitry an index element that is different from a previous fused index element, treating the current index element as a new fused index element, and generating an operation structure element representing adding a data element corresponding to the current index element to data of a specified address, wherein the data of the specified address is a 0 value.

6. A data processing apparatus as claimed in any one of claims 1 to 5, wherein each operation structure element indicates an in-situ addition operation comprising addresses pointing to two addends.

7. The data processing apparatus according to any one of claims 1 to 6,

the tensor interface circuit is configured to determine a data address of tensor data corresponding to the operand in a data storage space according to the shape information; and/or

The tensor interface circuit is configured to determine a dependency relationship between instructions according to the spatial information.

8. The data processing apparatus of any of claims 1-7, wherein the shape information of the tensor data comprises at least one shape parameter representing a shape of N-dimensional tensor data, N being a positive integer, the shape parameter of the tensor data comprising at least one of:

the size of a data storage space where the tensor data are located in at least one direction of N dimensional directions, the size of a storage area of the tensor data in at least one direction of the N dimensional directions, the offset of the storage area in at least one direction of the N dimensional directions, the positions of at least two vertexes located at diagonal positions of the N dimensional directions relative to a data reference point, and the mapping relation between the data description position of the tensor data and a data address.

9. The data processing apparatus according to any of claims 2 to 8, wherein the merge instruction comprises a first descriptor input to indicate an address offset of the K ways of data, an address offset of the K ways of index, and a size of the K ways of data.

10. The data processing apparatus according to claim 9, wherein the merge instruction further comprises an input second descriptor indicating a reference point address of the K ways of data, wherein the reference point address and an address offset of the K ways of data indicated by the first descriptor are combined to access the K ways of data.

11. The data processing apparatus according to any of claims 9-10, wherein the merge instruction further comprises a third descriptor input to indicate a reference point address of the K-way index, wherein the reference point address and an address offset of the K-way index indicated by the first descriptor are combined to access the K-way index.

12. The data processing apparatus according to any of claims 2 to 11, wherein the fused instruction further comprises an input fourth descriptor for indicating a memory address of a final result of execution of the operation structure element.

13. The data processing apparatus according to any of claims 2 to 12, wherein the fused instruction comprises a fifth descriptor output for indicating the fused index and the operation structure element.

14. The data processing apparatus according to any one of claims 1 to 13, wherein the data elements in the data to be fused are thinned-out valid data elements in radar-based object detection, and the index indicates position information of the valid data elements in pre-thinned-out data.

15. A data processing apparatus as claimed in any of claims 1 to 14, wherein an operation mode bit is included in said fused instruction to indicate a fused processing operation of said fused instruction, or said fused instruction itself indicates said fused processing operation.

16. A chip comprising a data processing device according to any one of claims 1 to 15.

17. A board comprising the chip of claim 16.

18. A method of data processing, comprising:

parsing a fuse instruction, the fuse instruction indicating that a fuse process is performed on multiple paths of data to be fused, and at least one operand of the fuse instruction including at least one descriptor indicating at least one of the following information: shape information of tensor data and spatial information of tensor data;

parsing the descriptor;

merging the data elements in the multi-path data to be merged into one path of ordered merged data according to the corresponding indexes of the data elements based on the analyzed descriptor and the merging instruction, wherein each merged data element is represented by using an operation structure element, and the data elements comprise any one of scalars, vectors or higher-dimensional data; and

and outputting the fusion data.

19. The data processing method according to claim 18, wherein the operation object of the fusion instruction includes input K ways of data to be fused, K ways of indexes corresponding to the K ways of data, sizes of the K ways of data, and an output one way of fusion indexes and one way of fusion data, K >1, where:

the fusion index elements in the fusion index are orderly arranged according to a second sequence, wherein the same index elements are fused into the same fusion index element; and is provided with

Fusion data elements in the fusion data correspond to the fusion index elements one to one, and the fusion data elements composed of data elements having the same index are characterized by operation structure elements of the accumulation operation of the data elements.

20. The data processing method of claim 19, further comprising:

the sorting circuit sorts the K indexes according to the sizes of the index elements and outputs the K indexes to the output circuit in order; and

in response to receiving the same index element as the previous fused index element from the sorting circuitry, the output circuitry generates an operation structure element representing an accumulation operation that accumulates the data element corresponding to the current index element to the fused data element corresponding to the previous fused index element, and removes duplicate index elements.

21. The data processing method of claim 20, further comprising:

in response to receiving an index element from the sorting circuitry that is different from a previous fused index element, outputting a current index element as a new fused index element; and

and generating an operation structure element for adding the data element corresponding to the current index element and the data of the specified address, wherein the data of the specified address is a 0 value.

22. A data processing method as claimed in any one of claims 18 to 21, wherein each operation structure element indicates an in-situ addition operation comprising addresses pointing to two addends.

23. The data processing method of any of claims 18-22, wherein parsing the descriptor comprises:

determining the data address of the tensor data in a data storage space according to the shape information; and/or

And determining the dependency relationship between the instructions according to the spatial information.

24. The data processing method of any of claims 18 to 23, wherein the shape information of the tensor data includes at least one shape parameter representing a shape of the N-dimensional tensor data, N being a positive integer, the shape parameter of the tensor data including at least one of:

the size of a data storage space where the tensor data are located in at least one of N dimensional directions, the size of a storage area of the tensor data in at least one of the N dimensional directions, the offset of the storage area in at least one of the N dimensional directions, the positions of at least two vertexes located at diagonal positions of the N dimensional directions relative to a data reference point, and the mapping relation between the data description position of the tensor data and a data address.

25. The data processing method of any of claims 19 to 24, wherein the fused instruction comprises an input first descriptor indicating an address offset of the K ways of data, an address offset of the K ways of index, and a size of the K ways of data.

26. The data processing method of claim 25, wherein the fused instruction further includes an input second descriptor indicating a reference point address of the K-way data, and parsing the descriptor further comprises:

combining the reference point address and the address offset of the K ways of data indicated by the first descriptor to obtain the address of the K ways of data.

27. The data processing method of any of claims 25-26, wherein the fused instruction further comprises an input third descriptor for indicating a fiducial address for the K-way index, and parsing the descriptor further comprises:

combining the reference point address and an address offset of the K-way index indicated by the first descriptor to obtain an address of the K-way index.

28. The data processing method according to any of claims 19 to 27, wherein said fused instruction further comprises an input fourth descriptor for indicating a memory address of a final result of execution of said operation structure element.

29. A data processing method as claimed in any of claims 19 to 28, wherein said fused instruction includes a fifth descriptor output for indicating said fused index and said operation structure element.