WO2022100345A1

WO2022100345A1 - Processing method, processing apparatus, and related product

Info

Publication number: WO2022100345A1
Application number: PCT/CN2021/123552
Authority: WO
Inventors: 刘少礼; 郝勇峥; 张英男; 王秉睿
Original assignee: 中科寒武纪科技股份有限公司
Priority date: 2020-11-13
Filing date: 2021-10-13
Publication date: 2022-05-19
Also published as: CN114489799A

Abstract

A processing method, a processing apparatus, and a related product. The processing apparatus can be implemented as a computing apparatus (510) included in a combined processing apparatus (500); the combined processing apparatus (500) further can comprise an interface apparatus (504) and other processing apparatuses (506); the computing apparatus (510) interacts with other processing apparatuses (506) so as to jointly complete a computing operation specified by a user; the combined processing apparatus (500) further can comprise a storage apparatus (508); and the storage apparatus (508) is separately connected to the computing apparatus (510) and other processing apparatuses (506) and used for storing the data of the computing apparatus (510) and other processing apparatuses (506). The method provides an instruction parallel solution that can improve instruction parallelism, thereby improving the processing efficiency of a machine.

Description

Processing method, processing device and related products

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the priority of the Chinese patent application filed on November 13, 2020, with the application number 2020112703785 and titled "Processing Method, Processing Device and Related Products".

technical field

The present disclosure relates to the field of processors, and in particular, to a processing method, a processing device, a chip and a board.

Background technique

The instruction system is the interface between computer software and hardware interaction, and is a very important part of the computer system structure. With the continuous development of artificial intelligence technology, the amount of data and data dimensions that need to be processed are constantly increasing. Therefore, how to control the execution of instructions reasonably and scientifically, especially to improve the degree of parallelism of instructions and improve the performance of the machine, is an important issue in the field of processors.

SUMMARY OF THE INVENTION

In order to solve one or more of the technical problems mentioned above, the present disclosure proposes solutions to enhance instruction parallelism in various aspects. Through the instruction system of the present disclosure, the degree of instruction parallelism can be improved, thereby improving the processing efficiency of the machine.

In a first aspect, the present disclosure provides a processing method, the method comprising: obtaining a first operation of an instruction, where the first operation is an operation on tensor data, the shape coordinate space of the tensor data includes at least a fine-grained region, the fine-grained region including one or more adjacent coordinate points of the shape coordinate space; determining whether there is an ongoing second operation on the tensor data; in the presence of the second operation When , determine whether the first fine-grained area currently targeted by the first operation overlaps with the second fine-grained area currently targeted by the second operation; When the fine-grained regions do not overlap, the first operation is performed.

In a second aspect, the present disclosure provides a processing device, comprising: an operation acquisition unit configured to acquire a first operation of an instruction, the first operation being an operation on tensor data, the shape of the tensor data The coordinate space includes at least one fine-grained area, and the fine-grained area includes one or more adjacent coordinate points of the shape coordinate space; a first determination unit is configured to determine whether there is an ongoing process for the tensor data the second operation; the second determination unit is configured to, when the second operation exists, determine the first fine-grained region currently targeted by the first operation and the second fine-grained region currently targeted by the second operation whether the granular regions overlap; and an execution unit configured to perform the first operation when the first fine-grained region does not overlap with the second fine-grained region.

In a third aspect, the present disclosure provides a chip including the processing device of any embodiment of the foregoing second aspect.

In a fourth aspect, the present disclosure provides a board including the chip of any embodiment of the foregoing third aspect.

Through the processing device, processing method, chip, and board provided above, the embodiments of the present disclosure perform operations in parallel based on the fine-grained region of the shape coordinate space of the tensor data targeted by the operation during the execution of the operation of the instruction. limit, so that the parallel execution potential of the operation can be exploited. Therefore, according to the embodiments of the present disclosure, during the parallel execution of hardware, the consistency of the execution order can be ensured, and the degree of parallelism of operations can be improved, thereby ensuring the accuracy and efficiency of processing.

Description of drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the accompanying drawings, several embodiments of the present disclosure are shown by way of example and not limitation, and like or corresponding reference numerals refer to like or corresponding parts wherein:

1A shows a schematic diagram of a data storage space according to an embodiment of the present disclosure;

1B shows a schematic diagram of data partitioning in a data storage space according to an embodiment of the present disclosure;

FIG. 2 shows a schematic block diagram of a processing apparatus according to an embodiment of the present disclosure;

3A-3C show schematic flowcharts of a processing method according to an embodiment of the present disclosure;

3D shows a schematic block diagram of a processing apparatus according to an embodiment of the present disclosure;

4 shows a schematic diagram of a coordinate space range according to an embodiment of the present disclosure;

FIG. 5 shows a structural diagram of a combined processing apparatus according to an embodiment of the present disclosure; and

FIG. 6 shows a schematic structural diagram of a board according to an embodiment of the present disclosure.

Detailed ways

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present disclosure.

It should be understood that the terms "first", "second", "third" and "fourth" that may be used in the claims, description and drawings of the present disclosure are used to distinguish different objects, rather than to describe specific order. The terms "comprising" and "comprising" as used in the specification and claims of this disclosure indicate the presence of the described features, integers, steps, operations, elements and/or components, but do not exclude one or more other features, integers , step, operation, element, component and/or the presence or addition of a collection thereof.

It should also be understood that the terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this disclosure and the claims, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise. It should further be understood that, as used in this disclosure and the claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.

As used in this specification and in the claims, the term "if" may be contextually interpreted as "when" or "once" or "in response to determining" or "in response to detecting". Similarly, the phrases "if it is determined" or "if the [described condition or event] is detected" may be interpreted, depending on the context, to mean "once it is determined" or "in response to the determination" or "once the [described condition or event] is detected. ]" or "in response to detection of the [described condition or event]".

Computers process various data by executing instructions. In order to indicate the source of the data, the destination of the operation result and the operation performed, an instruction usually contains the following information:

(1) Operation Code (OP), which is used to indicate the operation to be completed by the instruction (for example, addition, subtraction, multiplication, division, data transfer, etc.), which specifies the nature and function of the operation. A computer may have dozens to hundreds of instructions, each instruction has a corresponding opcode, and the computer completes different operations by identifying the opcode.

(2) Operand, which is used to describe the operation object of the instruction. The operand can relate to the data type, memory access address, addressing mode, etc. of the object being operated. The operand can directly give the operated object, or point out the memory address or register address (ie register name) of the operated object.

The instructions of conventional processors are designed to perform basic single-data scalar operations. Here, a single-data scalar operation means that each operand of the instruction is a scalar data. However, with the development of artificial intelligence technology, in tasks such as image processing and pattern recognition, the oriented operands are often multi-dimensional vector (ie, tensor data) data types, and only using scalar operations cannot make hardware efficient Complete the operation task. Therefore, how to efficiently perform multi-dimensional tensor data processing is also an urgent problem to be solved in the current computing field.

In an embodiment of the present disclosure, an instruction system is provided in which a descriptor is included in an operand of the instruction, through which information related to tensor data can be obtained. Specifically, the descriptor may indicate at least one of the following information: shape information of tensor data, and spatial information of tensor data. The shape information of the tensor data can be used to determine the data address in the data storage space of the tensor data corresponding to the operand. Spatial information of tensor data can be used to determine dependencies between instructions, which in turn can determine, for example, the execution order of instructions. The spatial information of tensor data can be indicated by a spatial identification (ID). A space ID can also be called a space alias, which refers to a space area used to store the corresponding tensor data. The space area can be a continuous space or multiple space. This disclosure does not have any specific composition of the space area. limit. Different spatial IDs indicate that there is no dependency between the pointed spatial regions. For example, it can be ensured that no dependencies exist by making the spatial regions pointed to by different spatial IDs not overlapped with each other.

Various possible implementations of the shape information of tensor data will be described in detail below with reference to the accompanying drawings.

Tensors can contain many forms of data composition. Tensors can be of different dimensions. For example, scalars can be regarded as 0-dimensional tensors, vectors can be regarded as 1-dimensional tensors, and matrices can be 2-dimensional or more than 2-dimensional tensors. The shape of a tensor includes information such as the dimension of the tensor and the dimensions of each dimension of the tensor. For example, for a three-dimensional tensor:

x ₃ = [[[1, 2, 3], [4, 5, 6]]; [[7, 8, 9], [10, 11, 12]]]

The shape or dimension of the tensor can be expressed as X ₃ =(2,2,3), that is, three parameters indicate that the tensor is a three-dimensional tensor, and the size of the first dimension of the tensor is 2, the third The dimension of the second dimension is 2 and the dimension of the third dimension is 3. When storing tensor data in the memory, the shape of the tensor data cannot be determined according to its data address (or storage area), and further related information such as the relationship between multiple tensor data cannot be determined, resulting in the processor to the tensor data. access efficiency is low.

In a possible implementation, the shape of the N-dimensional tensor data can be indicated by a descriptor, where N is a positive integer, such as N=1, 2, or 3, or zero. The three-dimensional tensor in the example above can be represented as (2, 2, 3) with descriptors. It should be noted that the present disclosure does not limit the manner in which the descriptor indicates the shape of the tensor.

In a possible implementation manner, the value of N may be determined according to the dimension (also referred to as the order) of the tensor data, and may also be set according to the usage needs of the tensor data. For example, when the value of N is 3, the tensor data is three-dimensional tensor data, and the descriptor can be used to indicate the shape (eg offset, size, etc.) of the three-dimensional tensor data in the three-dimensional direction. It should be understood that those skilled in the art can set the value of N according to actual needs, which is not limited in the present disclosure.

Although tensor data can be multi-dimensional, because the layout of memory is always one-dimensional, there is a correspondence between tensors and storage on memory. Tensor data is usually allocated in contiguous storage space, that is, the tensor data can be expanded one-dimensionally (eg, row-major manner) and stored on the memory.

This relationship between tensors and the underlying storage can be represented by the offset of the dimension (offset), the size of the dimension (size), the stride of the dimension (stride), and so on. The offset of a dimension refers to the offset relative to the reference position in that dimension. The size of a dimension refers to the size of the dimension, that is, the number of elements in the dimension. The step size of the dimension refers to the interval between adjacent elements in this dimension. For example, the step size of the three-dimensional tensor above is (6,3,1), that is, the step size of the first dimension is 6, and the second step size is 6. The step size of the dimension is 3, and the step size of the third dimension is 1.

FIG. 1A shows a schematic diagram of a data storage space according to an embodiment of the present disclosure. As shown in FIG. 1A , the data storage space 21 stores a two-dimensional data in a row-first manner, which can be represented by (x, y) (where the X axis is horizontally to the right, and the Y axis is vertically downward). The size in the X-axis direction (the size of each row, or the total number of columns) is ori_x (not shown in the figure), the size in the Y-axis direction (the total number of rows) is ori_y (not shown in the figure), and the data storage space The starting address PA_start (reference address) of 21 is the physical address of the first data block 22 . The data block 23 is part of the data in the data storage space 21, the offset 25 in the X-axis direction is represented as offset_x, the offset 24 in the Y-axis direction is represented as offset_y, and the size in the X-axis direction is represented by is size_x, and the size in the Y-axis direction is represented by size_y.

In a possible implementation manner, when the descriptor is used to define the data block 23, the data reference point of the descriptor may use the first data block of the data storage space 21, and the reference address of the descriptor may be agreed as the data storage space 21 The starting address of PA_start. Then the size ori_x on the X axis, the size ori_y on the Y axis of the data storage space 21, and the offset amount offset_y in the Y axis direction, the offset amount offset_x in the X axis direction, and the offset amount in the X axis direction of the data block 23 can be combined. The content of the descriptor of the data block 23 is determined by the size size_x and the size size_y in the Y-axis direction.

In a possible implementation, the following formula (1) can be used to represent the content of the descriptor:

It should be understood that although in the above example, the content of the descriptor represents a two-dimensional space, those skilled in the art can set the specific dimension of the content of the descriptor according to the actual situation, which is not limited in the present disclosure.

In a possible implementation manner, the reference address of the data reference point of the descriptor in the data storage space may be agreed upon. The location of the data datum point, which determines the content of the descriptor of the tensor data.

For example, the base address PA_base of the data base point of the descriptor in the data storage space may be agreed. For example, one piece of data (for example, the data at the position (2, 2)) may be selected in the data storage space 21 as the data reference point, and the physical address of the data in the data storage space may be used as the reference address PA_base. The content of the descriptor of the data block 23 in FIG. 1A can be determined according to the positions of the two diagonal vertices relative to the data reference point. First, the positions of at least two vertices of the diagonal positions of the data block 23 relative to the data reference point are determined, for example, the positions of the diagonal position vertices relative to the data reference point in the upper left to lower right direction are used, wherein the upper left corner vertex is The relative position is (x_min, y_min), the relative position of the lower right vertex is (x_max, y_max), and then the relative position of the upper left vertex (x_min, y_min) and the relative position of the lower right vertex (x_max, y_max) determines the content of the descriptor of the data block 23.

In a possible implementation, the following formula (2) can be used to represent the content of the descriptor (the base address is PA_base):

It should be understood that, although the vertices in the upper left corner and the lower right corner are used to determine the content of the descriptor in the above example, those skilled in the art can set the specific vertices of the at least two vertices in the diagonal positions according to actual needs. , this disclosure does not limit this.

In a possible implementation manner, the tensor can be determined according to the reference address of the data reference point of the descriptor in the data storage space and the mapping relationship between the data description position and the data address of the tensor data indicated by the descriptor. The content of the descriptor of the quantity data. Among them, the mapping relationship between the data description position and the data address can be set according to actual needs. For example, when the tensor data indicated by the descriptor is three-dimensional space data, the function f(x, y, z) can be used to define The data describes the mapping relationship between the location and the data address.

In a possible implementation, the following formula (3) can be used to represent the content of the descriptor:

In a possible implementation manner, the descriptor is further used to indicate the address of N-dimensional tensor data, wherein the content of the descriptor further includes at least one address parameter representing the address of the tensor data, for example, the content of the descriptor may be is the following formula (4):

Where PA is the address parameter. The address parameter can be a logical address or a physical address. When parsing the descriptor, PA can be used as any one of the vertex, middle point or preset point of the vector shape, and the corresponding data address can be obtained by combining the shape parameters in the X direction and the Y direction.

In a possible implementation manner, the address parameter of the tensor data includes a reference address of the data reference point of the descriptor in the data storage space of the tensor data, and the reference address includes a start address of the data storage space.

In a possible implementation manner, the descriptor may further include at least one address parameter representing the address of the tensor data, for example, the content of the descriptor may be the following formula (5):

Among them, PA_start is a reference address parameter, which is not repeated here.

It should be understood that those skilled in the art can set the mapping relationship between the data description location and the data address according to the actual situation, which is not limited in the present disclosure.

In a possible implementation manner, a predetermined reference address may be set in a task, the descriptors in the instructions under this task all use the reference address, and the content of the descriptor may include shape parameters based on the reference address. The base address can be determined by setting the environment parameters for this task. For the relevant description and usage of the reference address, reference may be made to the foregoing embodiments. In this implementation, the content of the descriptor can be mapped to the data address more quickly.

In a possible implementation manner, a reference address may be included in the content of each descriptor, and the reference address of each descriptor may be different. Compared with the way of setting a common reference address by using environment parameters, each descriptor in this way can describe data more flexibly and use a larger data address space.

In a possible implementation manner, the data address in the data storage space of the data corresponding to the operand of the processing instruction may be determined according to the content of the descriptor. Among them, the calculation of the data address is automatically completed by the hardware, and when the representation of the content of the descriptor is different, the calculation method of the data address will also be different. This disclosure does not limit the specific calculation method of the data address.

For example, the content of the descriptor in the operand is represented by formula (1), the offsets of the tensor data indicated by the descriptor in the data storage space are offset_x and offset_y respectively, and the size is size_x*size_y, then, the The starting data address PA1 _{(x, y)} of the tensor data indicated by the descriptor in the data storage space can be determined using the following formula (6):

PA1 _(x,y) =PA_start+(offset_y-1)*ori_x+offset_x(6)

According to the data starting address PA1 _(x,y ) determined by the above formula (6), combined with the offset offset_x and offset_y, and the size_x and size_y of the storage area, it can be determined that the tensor data indicated by the descriptor is stored in the data storage area. storage area in space.

In a possible implementation manner, when the operand further includes a data description location for the descriptor, the data address in the data storage space of the data corresponding to the operand can be determined according to the content of the descriptor and the data description location. In this way, part of the data (eg, one or more data) in the tensor data indicated by the descriptor can be processed.

For example, the content of the descriptor in the operand is represented by formula (2). The offsets of the tensor data indicated by the descriptor in the data storage space are offset_x and offset_y respectively, and the size is size_x*size_y. The operand includes The data description position for the descriptor is (x _q , y _q ), then, the data address PA2 _{(x, y)} of the tensor data indicated by the descriptor in the data storage space can use the following formula (7) to make sure:

PA2 _(x,y) = PA_start+(offset_y+y _q -1)*ori_x+(offset_x+x _q ) (7)

In one possible implementation, the descriptor may indicate chunked data. Data block can effectively speed up the operation speed and improve the processing efficiency in many applications. For example, in graphics processing, convolution operations often use data blocks for fast processing.

FIG. 1B shows a schematic diagram of a data block in a data storage space according to an embodiment of the present disclosure. As shown in FIG. 1B , the data storage space 26 also stores two-dimensional data in a row-first manner, which can be represented by (x, y) (where the X axis is horizontally to the right, and the Y axis is vertically downward). The size in the X-axis direction (the size of each row, or the total number of columns) is ori_x (not shown in the figure), and the size in the Y-axis direction (the total number of rows) is ori_y (not shown in the figure). Unlike the tensor data of FIG. 1A , the tensor data stored in FIG. 1B includes multiple data blocks.

In this case, the descriptor requires more parameters to represent these data chunks. Taking the X axis (X dimension) as an example, the following parameters can be involved: ori_x, x.tile.size (size 27 in the block), x.tile.stride (step size 28 in the block, that is, the first small distance between the first point of the block and the first point of the second tile), x.tile.num (the number of tiles, shown as 3 tiles in Figure 1B), x.stride (the overall stride length, that is, the distance from the first point of the first row to the first point of the second row) and so on. Other dimensions may similarly include corresponding parameters.

In one possible implementation, the descriptor may include the identifier of the descriptor and/or the content of the descriptor. The identifier of the descriptor is used to distinguish the descriptor, for example, the identifier of the descriptor may be numbered; the content of the descriptor may include at least one shape parameter representing the shape of the tensor data. For example, the tensor data is 3-dimensional data. Among the three dimensions of the tensor data, the shape parameters of two dimensions are fixed, and the content of the descriptor may include the shape representing the other dimension of the tensor data. parameter.

In a possible implementation manner, the data address of the data storage space corresponding to each descriptor may be a fixed address. For example, a separate data storage space can be divided for tensor data, and the starting address of each tensor data in the data storage space corresponds to a descriptor one-to-one. In this case, the circuit or module responsible for parsing the computing instruction (eg, an entity external to the computing device of the present disclosure) can determine the data address in the data storage space of the data corresponding to the operand according to the descriptor.

In a possible implementation manner, when the data address of the data storage space corresponding to the descriptor is a variable address, the descriptor can also be used to indicate the address of N-dimensional tensor data, wherein the content of the descriptor can also be Include at least one address parameter representing the address of the tensor data. For example, if the tensor data is 3-dimensional data, when the descriptor points to the address of the tensor data, the content of the descriptor may include an address parameter indicating the address of the tensor data, such as the starting physical address of the tensor data, It may also include multiple address parameters of the address of the tensor data, such as the start address + address offset of the tensor data, or address parameters of the tensor data based on each dimension. Those skilled in the art can set address parameters according to actual needs, which are not limited in the present disclosure.

In a possible implementation manner, the address parameter of the tensor data may include the reference address of the data reference point of the descriptor in the data storage space of the tensor data. Among them, the reference address can be different according to the change of the data reference point. This disclosure does not limit the selection of data benchmarks.

In one possible implementation, the reference address may include the start address of the data storage space. When the data reference point of the descriptor is the first data block of the data storage space, the reference address of the descriptor is the starting address of the data storage space. When the data reference point of the descriptor is other data than the first data block in the data storage space, the reference address of the descriptor is the address of the data block in the data storage space.

In a possible implementation manner, the shape parameter of the tensor data includes at least one of the following: the size of the data storage space in at least one direction of N dimensions, the size of the storage area in at least one direction of N dimensions The size of the storage area, the offset of the storage area in at least one direction of the N dimension directions, the position of at least two vertices at the diagonal positions of the N dimension directions relative to the data reference point, the tensor indicated by the descriptor The mapping relationship between the data description location of the data and the data address. The data description position is the mapping position of the point or area in the tensor data indicated by the descriptor. For example, when the tensor data is 3-dimensional data, the descriptor can be represented by three-dimensional space coordinates (x, y, z). The shape of the tensor data, and the data description position of the tensor data may be the position of a point or area in the three-dimensional space that the tensor data is mapped to, represented by three-dimensional space coordinates (x, y, z).

It should be understood that those skilled in the art can select the shape parameters representing the tensor data according to the actual situation, which is not limited in the present disclosure. By using descriptors in the data access process, the association between data can be established, thereby reducing the complexity of data access and improving the efficiency of instruction processing.

FIG. 2 shows a schematic block diagram of a processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 2 , the processing device 200 includes a control module 210 , an arithmetic module 220 and a storage module 230 .

The control module 210 can be configured to control the operation of the processing device 200, such as reading instructions from memory or externally, decoding (decoding) the instructions, and issuing micro-operation control signals to corresponding components. Specifically, the control module 210 may be configured to control the execution unit 220 to perform corresponding processing according to the received instruction. The instructions may include, but are not limited to, data access instructions, operation instructions, descriptor management instructions, and synchronization instructions. The present disclosure does not limit the specific type of instruction and the specific manner of decoding.

The decoded instruction includes the opcode and operands. When the instruction involves processing of tensor data, at least one operand of the instruction may include at least one descriptor indicating at least one of the following: shape information of the tensor data and spatial information of the tensor data.

The arithmetic module 220 is configured to execute specific instructions or operations under the control of the control module 210 . The operation module 220 may include, for example, but not limited to, an arithmetic logic unit (arithmetic and logic unit, ALU), a memory access unit (memory access unit, MAU), an artificial intelligence operation unit (neural functional unit, NFU), and the like. The present disclosure does not limit the specific hardware type of the execution unit.

The storage module 230 may be configured to store various information including, but not limited to, instructions, information associated with descriptors, tensor data, and the like. The storage module 230 may include various storage resources including, but not limited to, internal memory and external memory. Internal memory may include, for example, registers, on-chip SRAM, or other medium caches. External memory may include, for example, off-chip memory. The present disclosure does not limit the specific implementation of the memory module.

Alternatively or additionally, the processing apparatus 200 may further include a tensor interface unit (Tensor interface Unit, TIU) 240 . The tensor interface unit 240 may be configured to implement operations associated with the descriptor under the control of the control module 210 . These operations may include, but are not limited to, registration, modification, cancellation, and parsing of descriptors; reading and writing of content of descriptors. The present disclosure does not limit the specific hardware type of the tensor interface unit. In this way, operations associated with descriptors can be implemented through dedicated hardware, which further improves the access efficiency of tensor data.

In some embodiments of the present disclosure, tensor interface unit 240 may be configured to parse descriptors included in operands of instructions. For example, the tensor interface unit may parse the shape information of the tensor data included in the descriptor to determine the data address in the data storage space of the data corresponding to the operand.

Although the control module 210 and the tensor interface unit 240 are shown as two separate modules in FIG. 2 , those skilled in the art can understand that these two modules/units can also be implemented as one module or more modules. Disclosure is not limited in this regard.

The data processing apparatus 200 can be implemented by a general-purpose processor (such as a central processing unit CPU, a graphics processing unit GPU) and/or a special-purpose processor (such as an artificial intelligence processor, a scientific computing processor, or a digital signal processor, etc.). There is no restriction on the specific type of data processing device.

When the hardware executes instructions in parallel, if there is a dependency between the instructions executed in parallel, it may cause an error in the execution result. For example, if two instructions executed in parallel access the same storage unit, and at least one of the two instructions is an instruction to write to the storage unit, there is a dependency between the two instructions, such as read-after-write dependency, write-after-write dependency Write dependencies, or read-after-write dependencies. In this case, if the latter instruction is executed before the previous instruction, an execution error will result. Therefore, the order consistency of the execution of these instructions must be guaranteed, for example, by forced sequential execution, that is, the subsequent instruction must wait for the completion of the previous instruction to execute.

As can be seen from the previous description of tensor data, tensor data is usually a multi-dimensional array with a large amount of data, so the instruction processing time for tensor data is usually longer than that for scalar data. At this time, if the tensor data is still processed according to the previous order execution method, the processing time is too long and the efficiency is low. In view of this, in an embodiment of the present disclosure, an operation-level instruction parallelism scheme is provided, in which the parallelism of operations is restricted based on a fine-grained region of the shape coordinate space of the tensor data targeted by the operation, so that mining can be performed. The parallel execution potential of the operation. Therefore, according to the embodiments of the present disclosure, during the parallel execution of hardware, the consistency of the execution order can be ensured, and the degree of parallelism of operations can be improved, thereby ensuring the accuracy and efficiency of processing.

FIG. 3A shows an exemplary flowchart of a processing method 300 according to an embodiment of the present disclosure. The processing method 300 can be implemented, for example, by the processing apparatus 200 of FIG. 2 .

As shown in FIG. 3A , the method 300 starts at step S301 , a first operation of obtaining an instruction. This step may be performed, for example, by the control module 210 of FIG. 2 . In some embodiments, the first operation is an operation on tensor data whose shape coordinate space includes at least one fine-grained region. In some embodiments, the fine-grained region may include one or more adjacent coordinate points in the shape coordinate space of the tensor data. A fine-grained region is the smallest unit of operation.

It should be noted that the operations involved in the present disclosure may be basic operations supported by the processor hardware, or may be microinstructions (eg, request signals, etc.) after parsing the basic operations. The present disclosure does not limit the specific type of operation. The processing apparatus of the present disclosure may execute two operations in parallel, or may execute more than two operations in parallel, and the disclosure does not limit the number of operations executed in parallel. The two operations performed in parallel may belong to the same instruction or may belong to different instructions, and the present disclosure is not limited in this respect.

When hardware executes instructions in parallel, the processor can execute multiple operations in parallel. In order to avoid memory access conflicts, when multiple operations executed by the processor in parallel are for the same data, the processor will only execute one of the multiple operations. , while blocking other operations, thereby reducing the efficiency of the processor. In the embodiments of the present disclosure, the shape coordinate space of the processed tensor data is further divided into a plurality of fine-grained regions, and whether the operation can be executed in parallel is determined based on the fine-grained regions, thereby greatly improving the efficiency of the processor.

In some embodiments, the shape, size, and/or number of fine-grained regions may be determined based, at least in part, on at least one of: the computing power of the hardware; the bandwidth of the hardware; and the size of the shape coordinate space of the tensor data. The hardware computing capability may be the amount of data that the hardware processes in parallel in one computing cycle, and the hardware bandwidth may be the data transmission capability, such as the amount of data transmitted per unit time.

For example, the processor to which the processing method of the embodiment of the present disclosure is applied has the hardware computing capability of processing 100-bit data in parallel in one computing cycle, and the hardware bandwidth of transmitting 200-bit data per unit time. For two-dimensional tensor data, the shape coordinate space of the tensor data can be divided into 100 fine-grained regions according to the hardware computing capability, wherein each fine-grained region includes 100-bit data; the shape coordinate space can also be divided according to the hardware bandwidth is 50 fine-grained regions, wherein each fine-grained region includes 200-bit data.

It should be understood that the hardware computing capability and hardware bandwidth may vary according to different processor hardware, and the present disclosure does not limit the hardware computing capability and hardware bandwidth. In this way, the size and/or quantity of the fine-grained area can be determined according to the processing capability (hardware computing capability and/or hardware bandwidth) of the processor, so that the division result of the fine-grained area is more in line with the requirements of different hardware usage environments, The operations performed according to the fine-grained area tend to be synchronized with the processing capability of the processor, and the execution efficiency of the hardware can be exerted as much as possible, thereby improving the processing efficiency of the processor.

It should be noted that the shapes and sizes of the multiple fine-grained regions may be the same or different. For example, the first operation may carry the shape and size of the first fine-grained (position and number of coordinate points of each fine-grained area), and the first fine-grained may be set as a block of 8*8=64 coordinate points ( Assuming a two-dimensional tensor), and the second operation can carry the second fine-grained shape and size (such as the position and number of coordinate points of each fine-grained region), and the second fine-grained can be set to 16*16=256 A square of coordinates. That is, when the first operation is performed, the square of every 8*8=64 coordinate points is used as a fine-grained area, and when the second operation is performed, the square of every 16*16=256 coordinate points is used as a fine-grained area. . Likewise, the first operation may also carry a first fine-grained quantity (for example, set to 4), and the second operation may carry a second fine-grained quantity (for example, set to 8). That is, when the first operation is performed, the shape coordinate space is divided into 4 fine-grained regions, and when the second operation is performed, the shape coordinate space is divided into 8 fine-grained regions. It can be understood that the operation can also carry fine-grained parameters such as shape, size, and quantity at the same time. The shape, size and/or number of each fine-grained region can be determined according to requirements, which are not limited in the present disclosure.

Continuing with FIG. 3A, in step S302, it is determined whether there is an ongoing second operation on the tensor data.

As mentioned earlier, when the instruction involves the processing of tensor data, the operand includes a descriptor through which information related to the tensor data can be obtained. Therefore, in some embodiments, the descriptor may include spatial information of the tensor data (eg, a spatial identification ID), and the dependency between the instructions may be determined according to the spatial information of the tensor data. Since different spatial IDs indicate that there is no dependency between the spatial regions pointed to. Therefore, it can be quickly judged whether there is a dependency relationship between the two instructions according to whether the spatial IDs of the tensor data processed by the two instructions are the same, that is, whether they operate on the same tensor data.

In other embodiments, whether there is an ongoing second operation on the tensor data may be determined according to the occupancy state of the data storage area corresponding to the tensor data. For example, the processor may determine whether the data storage area of the tensor data is occupied by querying the occupancy status list, and if it is occupied, the judgment result is that there is an ongoing second operation on the tensor data. The occupancy state list may be preset and stored on the memory, or may be generated before the processor starts to execute a certain task, and is logged out after the task is completed. When the occupation status of each data storage area changes, the processor updates the content of the occupation status list to record the occupation status of each data storage area. The present disclosure does not limit the way of judging whether there is an ongoing second operation on one or more tensor data.

Next, in step S303, when there is such a second operation, it is determined whether there is overlap between the first fine-grained region currently targeted by the first operation and the second fine-grained region currently targeted by the second operation.

The first fine-grained region and the second fine-grained region may be any fine-grained regions among multiple fine-grained regions in the shape coordinate space of the tensor data. It can be understood that an operation on tensor data is an operation on each fine-grained region in the shape coordinate space of the tensor data. For example, assuming that tensor data A is a two-dimensional matrix with 8 rows and 16 columns, its shape coordinate space is a two-dimensional space, and every 2 rows and 4 columns is a fine-grained area, then the shape coordinate space of the tensor data includes 16 fine-grained regions. granular area. A write operation for tensor data A can be regarded as a write operation for the 16 fine-grained regions. The execution process can be: write the first fine-grained area (row 1-2, column 1-4), and write the second fine-grained area after the first fine-grained area (row 1-2, column 5) -8 columns), write the third fine-grained area (row 1-2, column 9-12) after the second fine-grained area is written, and so on until the 16th fine-grained area (the seventh -8 rows 13-16 columns), complete the write operation of tensor data A. Those skilled in the art can understand that operations can also be performed on multiple fine-grained regions at a time, for example, two or more fine-grained regions are written at a time until the operations on all regions are completed.

When there is an operation on tensor data, with the execution of the operation, the state of the fine-grained region in the shape coordinate space of the tensor data may include the completed state of the operation, the state of being operated, and the state of not being operated; or , for the case where it is not necessary to record whether the operation has been completed, the status can include the occupied status and the available status. The state of the fine-grained region currently targeted by the operation is the operating state or the occupied state. Therefore, when there is an operation on tensor data, it can be considered that there is an operation on a fine-grained region in the shape coordinate space of tensor data, and the fine-grained region that is being operated or is being occupied is the current operation. The fine-grained region targeted.

In a possible implementation manner, the first fine-grained area currently targeted by the first operation may include the fine-grained area targeted by the first operation to be executed. For example, when the operation is initially performed, it is specified to execute in a predetermined order. , usually the first fine-grained region. It may also include the fine-grained area currently targeted by the first operation being executed, which may be any fine-grained area. The second fine-grained area currently targeted by the second operation may be the fine-grained area currently targeted by the second operation being executed, and may be any fine-grained area.

In a possible implementation manner, when judging whether there is a second operation on the tensor data in progress before the first operation performs the operation on the tensor data, the first fine-grained first operation currently targeted by the first operation Region, which is the fine-grained region where the first operation will be performed. For example, before the first operation is performed on the tensor data, the first fine-grained region currently targeted by the first operation is usually the first fine-grained region in the shape coordinate space of the tensor data. At this time, the first operation has not performed an operation on the first fine-grained region. The second fine-grained area currently targeted by the ongoing second operation may be related to the execution process of the second operation. If the second operation has just started to be executed, the second fine-grained region may also be the first fine-grained region in the shape coordinate space of the tensor data. At this time, the first fine-grained area overlaps with the second fine-grained area. If the second operation has completed the operation of the first fine-grained area, and the second fine-grained area currently targeted is the P-th fine-grained area (P is an integer greater than 1), the first fine-grained area and the second fine-grained area The granular regions do not overlap.

In a possible implementation manner, when judging whether there is an ongoing second operation on the tensor data during the operation of the first operation on the tensor data, the first operation may be determined according to the execution process of the first operation. A fine-grained area, determining a second fine-grained area according to the execution process of the second operation, and then judging whether the first fine-grained area and the second fine-grained area overlap.

In a possible implementation manner, if the rhythm of the execution process of each operation is consistent, only before the first operation is performed on the tensor data, it can be judged whether there is a second operation on the tensor data in progress, and then judge whether there is an ongoing second operation on the tensor data. Whether the first fine-grained area overlaps with the second fine-grained area. Here, the beat consistency means that when the size of the fine-grained region is the same, the operation duration of two operations on a fine-grained region is the same.

In a possible implementation manner, if the beats of the execution processes of each operation are inconsistent or it cannot be determined whether they are consistent, during the operation of the first operation on tensor data, each time the operation of the first fine-grained region currently targeted is completed. Then, continue to judge whether there is an ongoing second operation on the tensor data, and continue to judge whether the first fine-grained region and the second fine-grained region overlap, so as to determine whether the first operation can be continued.

In a possible implementation manner, the first fine-grained area currently targeted by the first operation and the second fine-grained area currently targeted by the second operation may be determined according to the coordinate address, pointer position, fine-grained area identifier, etc. whether they overlap. For example, the coordinate address of the current tensor data of each operation can be recorded, and the first operation can be determined according to the current coordinate address of the first operation and the current coordinate address of the second operation, as well as the correspondence between the coordinate address and the fine-grained area. The first fine-grained area currently targeted and the second fine-grained area currently targeted by the second operation, and then it is determined whether the first fine-grained area and the second fine-grained area overlap. Both the coordinate address and the fine-grained region are defined based on the shape coordinate space of the tensor data. Therefore, after knowing the fine-grained division of the shape coordinate space, the corresponding fine-grained region can be directly determined from the coordinate address. For another example, a pointer may be set for each operation, and the pointer points to the fine-grained region currently targeted by the operation. According to the pointer position of the first operation and the pointer position of the second operation, respectively determine the first fine-grained area currently targeted by the first operation and the second fine-grained area currently targeted by the second operation, and then determine the first fine-grained area Whether to overlap with the second fine-grained region. For another example, an identifier may also be set for each fine-grained area, and whether the first fine-grained area and the second fine-grained area overlap is determined by recording the identifier of the fine-grained area currently targeted by the operation. Identification can include any combination of letters, numbers or symbols. It is also possible to judge whether the first fine-grained area and the second fine-grained area overlap in other ways, and the present disclosure does not limit the basis for judging whether the first fine-grained area and the second fine-grained area overlap.

Next, in step S304, when there is no overlap between the first fine-grained region and the second fine-grained region, a first operation is performed.

In a possible implementation manner, if the first fine-grained area currently targeted by the first operation does not overlap with the second fine-grained area currently targeted by the second operation, it may be that the first fine-grained area is the first fine-grained area that the second operation has already targeted The fine-grained area where the operation is completed may also be a fine-grained area where the second operation does not need to be performed. At this time, executing the first operation will not affect the operation process and operation result of the second operation, and the first operation can be performed.

According to this embodiment, when the shape coordinate space of the tensor data targeted by the first operation includes at least one fine-grained region, and there is an ongoing second operation on the tensor data, it is possible to determine the first operation currently targeted by the first operation. Whether there is overlap between a fine-grained region and the second fine-grained region currently targeted by the second operation, and when there is no overlap between the two, the first operation is performed. In this way, the fine-grained regions of the current operation of the first operation and the second operation can be executed without overlapping, so that the first operation and the second operation can operate on the same tensor data at the same time, which improves the processing efficiency of the processor.

In a possible implementation manner, the processing method 300 may further include: blocking the first operation when the first fine-grained region overlaps with the second fine-grained region.

In a possible implementation manner, the first fine-grained region and the second fine-grained region overlap, including the first fine-grained region and the second fine-grained region completely or partially overlapping. When the first fine-grained area overlaps with the second fine-grained area, if the first operation is performed, the operation of the first operation on the overlapping partial area may affect the execution of the second operation, resulting in an inaccurate operation result of the second operation. The execution of the first operation may be affected, resulting in an inaccurate operation result of the first operation. At this time, the first operation may be blocked, that is, the execution of the first operation may be suspended, and the first operation may be executed after the second operation completes the operation on the second fine-grained region currently targeted. That is, when the first fine-grained area does not overlap with the second fine-grained area, the first operation is performed.

In this embodiment, when the first fine-grained area overlaps with the second fine-grained area, blocking the first operation can avoid operation errors and inaccurate operation results caused by the overlapping of the fine-grained areas of each operation, ensuring that The correctness of each operation.

In some embodiments, at least one of the first operation and the second operation may be a write operation. That is, when the operation on the target data is read after write (the second operation is a write operation, and the first operation is a read operation), read after write (the second operation is a read operation, and the first operation is a write operation), or write after write (both the second operation and the first operation are write operations), there will be a dependency relationship between the two operations, in which case the method in the embodiment of the present disclosure can be used. In these embodiments, by dividing the shape coordinate space of the target data into one or more fine-grained regions, and performing operations in units of fine-grained regions, operations such as read-after-write, read-after-write, and write-after-write can be achieved. It can be executed correctly to obtain an accurate execution result, and the waiting time between operations can be reduced, and the execution efficiency of the processor can be improved.

In an embodiment of the present disclosure, based on the fine-grained region division of the shape coordinate space of tensor data, a processing method for determining the execution range of an operation based on the coordinate space range expressed by the fine-grained region is also provided.

FIG. 3B schematically shows an exemplary flowchart of a processing method according to an embodiment of the present disclosure. Likewise, the processing method of FIG. 3B can be implemented by, for example, the processing apparatus 200 of FIG. 2 .

As shown in FIG. 3B, in step S311, a first coordinate space range of the number of tensors allowed to be used by the first operation is determined. This step may be performed, for example, by the control module 210 of FIG. 2 . The first coordinate space range may be, for example, a portion of the shape coordinate space of the tensor data involved in the first operation.

Next, in step S312, the second coordinate space range of the tensor data to be used when the first operation is performed is determined. This step may be performed, for example, by the execution unit 220 of FIG. 2 . The second coordinate space range may be, for example, a portion of the shape coordinate space of the tensor data involved in the first operation.

Finally, in step S313, the first operation is performed within the third coordinate space range determined by the intersection of the first coordinate space range and the second coordinate space range. This step may be performed, for example, by the execution unit 220 of FIG. 2 .

The first coordinate space range, the second coordinate space range, and the third coordinate space range can all be expressed based on fine-grained regions in the shape coordinate space of the tensor data, that is, the first, The second and third coordinate space extents.

In the embodiment of the present disclosure, by restricting the coordinate space range of tensor data that can be used when the operation is executed, for example, by restricting the operation to be executed within the third coordinate space range, it can be ensured that the instruction is executed in each instruction during parallel execution. Accesses over the extent of the coordinate space are all sequential, thereby ensuring accurate and efficient processing. Further, since the programming on the software side usually uses spatial coordinates to refer to data points or data blocks in the tensor data, the parallel execution of operations is constrained by the coordinate space range of the tensor data, which can simplify the code programming on the software side. It is more conducive to the execution of instructions.

In some embodiments, the overlap determination of the fine-grained regions described above in conjunction with FIG. 3A and FIG. 3B may be performed only under certain conditions, thereby shortening the determination time and speeding up instruction processing.

FIG. 3C schematically shows an exemplary flowchart of a processing method according to another embodiment of the present disclosure.

As shown in FIG. 3C, in step S321, the first operation of the instruction is obtained. In some embodiments, the first operation is an operation on tensor data, and its operand may include a descriptor of the tensor data.

Next, in step S322, it is determined whether there is an ongoing second operation on the tensor data. This operation is similar to step S302 described above in conjunction with FIG. 3A , and will not be repeated here.

If it is determined that there is no such second operation, the method may directly jump to step S326, that is, execute the first operation. This means that there is no second operation that could conflict with the first operation, so the first operation can be executed immediately. When there are other operations being performed, the first operation is performed in parallel with these other operations at this time.

If it is determined that there is such a second operation, that is, a conflict may occur, the method may proceed to step S323, where it is further determined whether the data operation ranges of the first operation and the second operation overlap. It can be understood that since tensor data usually has a large dimension, the data operation range for different operations may be different. When the data operation ranges of different operations do not overlap with each other, the first operation can be performed in parallel with the preceding second operation without conflict.

There are various ways to determine whether the data manipulation ranges of an operation overlap. In some embodiments, whether the data operation ranges overlap may be determined based on spatial information and/or shape information of the tensor data to be operated on. For the spatial information and shape information of tensor data, please refer to the previous detailed description, which will not be repeated here. The shape information of the tensor data can be used to determine the access address of the operation, so as to determine whether there is overlap between the data operation ranges of the two operations. The access address may be a coordinate space address of tensor data or a storage space address of tensor data, which is not limited in this aspect of the present disclosure.

If it is determined in step S323 that the data operation ranges of the first operation and the second operation do not overlap, the method may jump to step S326, that is, the first operation is performed. This means that even if the first operation and the second operation access the same tensor data (determined in step S322), as long as the data operation ranges of the first operation and the second operation do not overlap, that is, those accessing the same tensor data respectively do not overlap each other. part, the first operation can be performed in parallel with the second operation.

If it is determined in step S323 that the data operation ranges of the first operation and the second operation overlap, the method may proceed to step S324, where it is further determined whether the fine-grained regions currently targeted by the first operation and the second operation overlap. For a specific determination method, reference may be made to the foregoing description in conjunction with FIG. 3A and FIG. 3B .

When it is determined in step S324 that the data operation ranges of the first operation and the second operation do not overlap, the method may proceed to step S326, that is, the first operation is performed. Therefore, based on the dynamic execution of the operation, it is possible to dynamically determine whether the currently targeted fine-grained regions overlap, so as to realize the parallel execution of the operation at the level of the fine-grained region, and maximize the parallel potential of the operation.

If it is determined in step S324 that the data operation ranges of the two operations overlap, the first operation cannot be performed at this time, otherwise a conflict will be caused. Therefore, in step S325, the first operation is blocked.

In the embodiment of FIG. 3C , by first performing static pre-judgment based on the data operation range of the operation, only under certain conditions (that is, when the data operation range overlaps), the dynamic fine-grained area overlap judgment is performed, which can effectively Shorten the judgment time and speed up the command processing.

The present disclosure also provides exemplary processing apparatuses for implementing the processing methods of FIGS. 3A, 3B, and 3C. FIG. 3D shows a schematic functional block diagram of a processing apparatus according to an embodiment of the present disclosure.

As shown in FIG. 3D , the processing apparatus 30 includes an operation acquisition unit 31 , a first determination unit 32 , a second determination unit 33 and an execution unit 34 .

The operation acquisition unit 31 is configured to acquire the first operation of the instruction. The first operation is an operation on tensor data. The shape coordinate space of the tensor data includes at least one fine-grained region, and each fine-grained region includes one or more adjacent coordinate points in the shape coordinate space. The first determination unit 32 is configured to determine whether there is an ongoing second operation on the tensor data. The second determining unit 33 is configured to, when there is such a second operation, determine whether the first fine-grained region currently targeted by the first operation overlaps with the second fine-grained region currently targeted by the second operation. The executing unit 34 is configured to execute the first operation when the first fine-grained region does not overlap with the second fine-grained region.

In some embodiments, the second determination unit 33 may include a first determination subunit 331 and a second determination subunit 332 . The first determination subunit 331 is configured to determine a first coordinate space range of the tensor data that is allowed to be used by the first operation. The second determination subunit 332 is configured to determine the second coordinate space range of the tensor data to be used when the first operation is performed. In these embodiments, the execution unit 34 may be configured to execute the first operation within the third coordinate space range determined by the intersection of the first coordinate space range and the second coordinate space range. In these embodiments, the first coordinate space extent, the second coordinate space extent, and the third coordinate space extent are characterized using fine-grained regions in the shape coordinate space of the tensor data.

In some embodiments, the processing device 30 may further include a blocking unit 35 and a third determining unit 36 . The blocking unit 35 may be configured to block the first operation to prevent a conflict from occurring when it is determined that the first operation overlaps with the fine-grained region currently targeted by the second operation.

In some embodiments, the third determination unit 36 may be configured to perform a static judgment in advance, that is, to determine whether the data operation ranges of the first operation and the second operation overlap. The judgment of the second determination unit 33 is performed only when the data operation ranges overlap. The execution unit 34 may execute the first operation according to the judgment result of each determination unit.

Those skilled in the art can understand that each unit shown in FIG. 3D is divided according to function implementation. This division is only exemplary, in actual implementation, two or more functions may be implemented in the same hardware unit, and one function may also be implemented in two hardware units distributed. For example, in one implementation, the operation obtaining unit 31, the first determining unit 32 and the optional third determining unit 36 may be included in the control module 210 of the processing apparatus 200 shown in FIG. 2, while the second determining unit 33 and The execution unit 34 may be included in the arithmetic module 220 of the processing device 200 . For another example, in another implementation, the operation acquisition unit 31 , the first determination unit 32 , the second determination unit 33 and the optional third determination unit 36 may be included in the control module 210 of the processing apparatus 200 shown in FIG. 2 , and the execution unit 34 is included in the operation module 220 of the processing device 200 .

It should also be understood that the units included in the processing device 30 correspond to the various steps in the method described with reference to Figures 3A, 3B and 3C. Therefore, the operations and features described above with respect to the method are also applicable to the processing device 30 and the units included therein, and details are not repeated here.

FIG. 4 schematically illustrates the division of coordinate space ranges according to an embodiment of the present disclosure. FIG. 4 uses two-dimensional data as an example for illustrative illustration, but those skilled in the art can understand that the same solution can be similarly applied to three-dimensional or even more dimensional tensor data.

As shown in FIG. 4 , the shape coordinate space 400 of the two-dimensional tensor data is divided into 12 fine-grained regions, 4001, 4002, . . . , 4011 and 4012, respectively. On each fine-grained region, access to it is guaranteed to be sequential. Any data element (eg, data point) on the tensor data can be represented by two-dimensional spatial coordinates (x, y) (where the X axis is horizontally to the right, and the Y axis is vertically downward). Obviously, the coordinates of any data element on the tensor data will not exceed the maximum size of the shape's coordinate space.

In some embodiments, all fine-grained regions in the shape coordinate space of the tensor data that are not currently used by prior operations associated with the first operation may be determined as the first coordinate space range.

In these embodiments, for example, when the fine-

grained regions

4004 , 4008 , 4009 - 4012 are used by the previous operation, the spatial range (ie, the first coordinate spatial range) that is allowed to be used by the first operation at this time may include the fine-grained region 4001 -4003 as well as fine-grained regions 4005-4007, as shown with diagonal shading.

Alternatively or additionally, in some embodiments, a fine-grained region determined based on the coordinates of the tensor data to be accessed by the first operation is determined as the second coordinate space range.

In these embodiments, for example, when the first operation is expected to use fine-grained regions other than fine-grained region 4001 and fine-grained region 4002 (eg, estimated from the coordinates of the tensor data to be accessed), the execution at this time The spatial extent to be used by the first operation (ie, the second coordinate spatial extent) may be determined as fine-grained regions 4003-4012, as indicated by dot filling.

Then, according to the embodiment of the present disclosure, the operable range when the first operation is actually performed, that is, the third coordinate space range, is the intersection of the first coordinate space range and the second coordinate space range. As shown in FIG. 4 , in the current example, the third coordinate space range is an area where both oblique line shadows and dot filling exist, that is, the fine-grained areas 4003 and 4005-4007 in FIG. 4 .

In some embodiments, the first coordinate space range, the second coordinate space range, and the third coordinate space range can be directly characterized by the identification of the fine-grained region included in each of them. For example, in the example shown in FIG. 4, the first coordinate space range can be characterized by the identification of fine-grained regions 4001-4003 and fine-grained regions 4005-4007; The third coordinate space range can be characterized using the identification of fine-grained regions 4003 and 4005-4007.

Considering that in most cases, the access to tensor data is usually based on a certain dimension, the access coordinates are gradually increased, and the data units on each coordinate point in the tensor data are traversed from front to back.

Thus, in other embodiments, the first coordinate space extent is characterized using a coordinate upper bound on one or more dimensions of the tensor data that allows the fine-grained region used by the first operation; and/or the second coordinate space extent Characterized using coordinate lower bounds in one or more dimensions of the tensor data for the fine-grained region that the first operation is expected to use. By utilizing the feature of accessing tensor data in order of dimensions, only the upper coordinate bound or the lower coordinate bound can be used to characterize the first coordinate space range or the second coordinate space range, thereby simplifying control information and corresponding control methods.

Still taking FIG. 4 as an example, as previously described, the first coordinate space extent may be characterized using a fine-grained region of upper bounds on coordinates on one or more dimensions of the tensor data that are allowed to be used by the first operation. For example, when the previous operation used the rightmost column and the lowermost row, a total of 6 fine-grained regions, the space range allowed to be used by the first operation (that is, the first coordinate space range) at this time may include the upper left two rows and three columns , a total of 6 fine-grained regions, as indicated by the diagonally shaded regions. At this time, in FIG. 4 , the first coordinate space range can be characterized by the X upper bound 411 on the X axis and the fine-grained region where the Y upper bound 421 on the Y axis is located. In this example, the first coordinate space extent can be characterized using fine-grained regions 4003 and 4005, which indicate that the data coordinates accessed by the first operation cannot exceed the fine-grained region 4003 in the X dimension and cannot exceed the fine-grained region in the Y dimension Area 4005.

Similarly, the second coordinate space extent may be characterized using a fine-grained region of lower bounds on coordinates on one or more dimensions of the tensor data that the first operation is expected to use. For example, when it is determined that the first operation will use fine-grained regions other than the two fine-grained regions on the left of the first row based on the coordinates of the tensor data to be accessed by the first operation, it can be determined that the second coordinate space range includes the remaining 10 a fine-grained region, as shown in the dot-filled section. At this time, in FIG. 4 , the second coordinate space range can be characterized by the X lower bound 412 on the X axis and the fine-grained region where the Y lower bound 422 on the Y axis is located. In this example, the second coordinate space range can be characterized by fine-

grained regions

4002 and 4001, which indicate that when the first operation is performed, the X dimension in the tensor data is lower than the fine-grained region 4002, and the Y dimension is lower than that in the tensor data. Data in the fine-grained region 4001.

The operable range when the first operation is actually performed is the third coordinate space range, which is the intersection of the first coordinate space range and the second coordinate space range. As shown in FIG. 4 , in the current example, the third coordinate space range is an area where both oblique line shadows and dot filling exist, that is, the “inverse L-shaped” area in FIG. 4 .

The first coordinate space extent and the second coordinate space extent can be determined in various ways.

In some embodiments, for example, at least one of the following may be additionally considered to determine the first coordinate space range: the sequence of operations; the operands involved in the operations; the second coordinate space range of the previous operation. For example, in an embodiment in which an upper bound and a lower bound of the coordinate space are used to represent the coordinate space range, the lower bound of the coordinate space of the tensor data used by the previous operation or instruction can be used as the upper bound of the tensor data used by the current new instruction. boundary.

In an example, when the first operation (ie, the current operation) is a read operation, the upper bound of the coordinate space is the lower bound of the coordinate space of the nearest (ie, the previous operation or the previous operation) for the write operation on the tensor data .

In another example, when the first operation is a write operation, the upper bound of the coordinate space is the lower bound of the coordinate space of the most recent write operation on the tensor data and all the tensor data between the two write operations. The minimum value in the lower bound of the coordinate space of the read operation. By selecting the minimum value, it can be ensured that the execution of the first operation will not affect the execution of any preceding operation.

Alternatively or additionally, the second coordinate space range may be determined based on at least one of: an execution range of the operation; an access mode of the operation; and a current execution state of the operation. For example, in the embodiment in which the upper and lower bounds of the coordinate space are used to represent the coordinate space range, the above factors can be comprehensively considered to determine the second coordinate space range, so as to ensure that when accessing tensor data according to the dimension, the coordinates on the corresponding dimension are not less than the lower bound of the coordinate space. Further, try to provide the maximum value of the lower bound of the coordinate space, so that the accessible space range for subsequent operations or instructions will be larger.

In one example, when the access mode of the first operation is sequential and sequential access, the coordinate space lower bound may be determined based on the minimum access coordinate of the first operation. For example, the coordinate space lower bound may be determined as the fine-grained region in which the smallest access coordinate is located. As shown in Figure 4, when the first operation accesses data according to the X dimension, assuming that the minimum X coordinate of the accessed data is A, which is located in the second fine-grained area from the left, the X lower bound can be determined as the second fine-grained area. Granularity area; when the first operation accesses data according to the Y dimension, assuming that the minimum Y coordinate of the accessed data is B and falls in the third fine-grained area from the top, the Y lower bound can be determined as the third fine-grained area .

In another example, when the access mode of the first operation is regular access, the coordinate space lower bound may be determined based on the regularity. For example, in the convolution operation, it may be necessary to access the tensor data in blocks, so the lower bound of the coordinate space can be determined according to the block law of the convolution operation.

In yet another example, when the access mode of the first operation cannot be determined, the coordinate space lower bound may be determined based on a predetermined setting. For example, the coordinate space lower bound may be a default value such as 0 or the size of 1 or more fine-grained regions.

In some embodiments, the first and second coordinate space extents may be determined based on a pre-division of the shape coordinate space of the tensor data. Specifically, the shape coordinate space of the tensor data may be firstly divided into several spatial blocks, for example, divided uniformly or non-uniformly in each dimension, and each spatial block includes one or more fine-grained regions. For example, still referring to FIG. 4, the shape coordinate space of tensor data is divided into 6 spatial blocks, each spatial block A, B, C, D, E and F including 2 fine-grained regions.

In these embodiments, a block of space in the shape coordinate space of the tensor data, where the previous operation has been completed, may be determined as the first coordinate space extent; and will be determined based on the coordinates of the tensor data to be accessed by the first operation The space block is determined as the second coordinate space range.

For example, when the previous operation has completed the access to the space blocks A and B and is using the space block C, the space range (that is, the first coordinate space range) that is allowed to be used by the first operation at this time may include the space block A and the space block C. block B. Further, for example, when it is expected that the first operation will use the spatial block A and the spatial block B (for example, estimated according to the coordinates of the tensor data to be accessed), the spatial range (that is, the first operation will be used) is performed at this time. The two-coordinate space range) can be determined as space block A and space block B.

Then, according to the embodiment of the present disclosure, the operable range when the first operation is actually performed, that is, the third coordinate space range, is the intersection of the first coordinate space range and the second coordinate space range. In the current example, the third coordinate space extents are space blocks A and B.

Alternatively or additionally, in some embodiments, within the third coordinate space, the first operation may be performed based on at least one order of: a predetermined spatial block order; and/or a predetermined fine-grained region order.

In some implementations, after the shape coordinate space of the tensor data to be operated is divided into blocks in advance, such as the six space blocks in FIG. 4 , a space block sequence can be predetermined, that is, the operation on each space block in the coordinate space The order, for example, in the order of spatial blocks A, B, C, D, E and F. In this case, if the operands or usage spaces of the two instructions with dependencies are the entire tensor data, the instructions can operate on the space blocks one by one in that order. For example, assuming that the preceding instruction 1 is to write tensor data, and the following instruction 2 is to read the tensor data, then instruction 1 can first perform a write operation on space block A, and then perform a write operation on space block B. At this point, the instruction 2 can start to read the space block A. If the division of space blocks makes the execution rhythm of instruction 2 and instruction 1 consistent, then in the subsequent time, when instruction 1 starts to write space block C, instruction 2 has also completed the read operation of space block A, and starts A read operation on space block B; and so on. It can be seen that the division of space blocks is conducive to the parallel execution of instructions, and the convention of the order of space blocks is conducive to simplifying operation scheduling, shortening processing time, and improving processing efficiency.

Alternatively or additionally, in some implementations, when the first operation is performed within a single spatial block, it may also be performed in a predetermined fine-grained region order. When the operation scope of the parallel-executed instructions is further controlled based on the fine-grained region of the current operation within a single space block, this sequential execution of the predetermined fine-grained region is beneficial to simplify the operation scheduling, the principle of which is similar to that described above The principle of executing in the order of predetermined space blocks will not be repeated here.

In yet other embodiments, the first and second coordinate space extents may also be determined by combining dynamic determination with pre-partitioning of the tensor data shape coordinate space. Specifically, the shape coordinate space of the tensor data may be firstly divided into several space blocks, for example, divided uniformly or non-uniformly in each dimension. Next, within each spatial block, the first and second coordinate spatial extents may be determined dynamically based on the execution of the operation. For a specific determination manner, reference may be made to the foregoing description, which is not repeated here. In these implementations, when the exact position of the second coordinate space range in a certain space block cannot be determined, the range corresponding to the space block can be defaulted.

In some embodiments, the pre-division of the shape coordinate space of the tensor data may be performed based on at least one of the following: the processing capability of the hardware; preset parameters; and the size of the shape coordinate space of the tensor data. The processing capability of the hardware may include, but is not limited to, the data bit width that the hardware can process. Based on the data bit width that the hardware can process, the shape coordinate space of the tensor data is divided, which can give full play to the processing capability of the hardware and improve the efficiency of parallel processing. For example, the preset parameters can directly specify the number of space blocks to be divided, the size of each dimension of the space block, and so on. The shape coordinate space of the tensor data is segmented based on the size/dimension of the shape coordinate space of the tensor data. For example, when the tensor data is a two-dimensional matrix, its size is M rows*N columns (M and N are positive integers), then each row can be divided into m parts on average, and each column can be divided into n parts on average, so that A total of m*n space blocks.

Although the six space blocks divided equally are shown in FIG. 4 , they can also be divided into various numbers of space blocks with unequal sizes, and the present disclosure does not limit the specific division manner. The above describes the scheme of constraining the space range actually used by the operation in order to ensure the sequential consistency of data processing and improve the efficiency of parallel processing when the hardware performs operations in parallel. Those skilled in the art can understand that the current operation (for example, the aforementioned first operation) and the previous operation (or the previous operation) may be operations in different instructions executed in parallel; and the current operation and the previous operation may also be respectively The present disclosure is not limited in this regard to different operations performed in parallel within the same instruction.

The processing method performed by the processing apparatus of the embodiment of the present disclosure has been described above with reference to the flowchart. Those skilled in the art can understand that since operations performed in parallel are constrained based on the coordinate space range of the processed data, the degree of parallelism of operations can be improved while ensuring sequential consistency of operations, thereby improving processing efficiency. It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present disclosure is not limited by the described action sequence. As in accordance with the present disclosure, certain steps may be performed in other orders or concurrently. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily required by the present disclosure.

It should be further noted that although the steps in the method flow chart are displayed in sequence according to the arrows, these steps are not necessarily executed in the sequence indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in the method flow chart may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed and completed at the same time, but may be executed at different times. The order of execution is also not necessarily sequential, but may be performed alternately or alternately with other steps or sub-steps of other steps or at least a portion of a phase.

FIG. 5 is a structural diagram illustrating a combined processing apparatus 500 according to an embodiment of the present disclosure. As shown in FIG. 5 , the combined processing device 500 includes a computing processing device 502 , an interface device 504 , other processing devices 506 and a storage device 508 . According to different application scenarios, the computing and processing apparatus may include one or more computing apparatuses 510, and the computing apparatus may be configured as the processing apparatus 200 shown in FIG. 2 for performing the operations described herein in conjunction with FIG. 4 .

In various embodiments, the computing processing devices of the present disclosure may be configured to perform user-specified operations. In an exemplary application, the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core. When multiple computing devices are implemented as an artificial intelligence processor core or a part of the hardware structure of an artificial intelligence processor core, for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure.

In an exemplary operation, the computing processing apparatus of the present disclosure may interact with other processing apparatuses through an interface apparatus to jointly complete an operation specified by a user. According to different implementations, other processing devices of the present disclosure may include central processing units (Central Processing Unit, CPU), graphics processing units (Graphics Processing Unit, GPU), artificial intelligence processors and other general-purpose and/or special-purpose processors. One or more types of processors. These processors may include, but are not limited to, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs. As mentioned above, only for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when computing processing devices and other processing devices are considered together, the two can be viewed as forming a heterogeneous multi-core structure.

In one or more embodiments, the other processing device may serve as an interface for the computing processing device of the present disclosure (which may be embodied as a related computing device for artificial intelligence such as neural network operations) with external data and control, performing operations including but not limited to Limited to basic controls such as data movement, starting and/or stopping computing devices. In other embodiments, other processing apparatuses may also cooperate with the computing processing apparatus to jointly complete computing tasks.

In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing and processing device may obtain input data from other processing devices via the interface device, and write the input data into the on-chip storage device (or memory) of the computing and processing device. Further, the computing and processing device may obtain control instructions from other processing devices via the interface device, and write them into a control cache on the computing and processing device chip. Alternatively or alternatively, the interface device can also read the data in the storage device of the computing processing device and transmit it to other processing devices.

Additionally or alternatively, the combined processing device of the present disclosure may also include a storage device. As shown in the figure, the storage device is connected to the computing processing device and the other processing device, respectively. In one or more embodiments, a storage device may be used to store data of the computing processing device and/or the other processing device. For example, the data may be data that cannot be fully stored in an internal or on-chip storage device of a computing processing device or other processing device.

In some embodiments, the present disclosure also discloses a chip (eg, chip 602 shown in FIG. 6). In one implementation, the chip is a System on Chip (SoC) and integrates one or more combined processing devices as shown in FIG. 5 . The chip can be connected with other related components through an external interface device (such as the external interface device 606 shown in FIG. 6 ). The relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card or a wifi interface. In some application scenarios, other processing units (such as video codecs) and/or interface modules (such as DRAM interfaces) may be integrated on the chip. In some embodiments, the present disclosure also discloses a chip package structure including the above-mentioned chip. In some embodiments, the present disclosure also discloses a board including the above-mentioned chip package structure. The board will be described in detail below with reference to FIG. 6 .

FIG. 6 is a schematic structural diagram illustrating a board 600 according to an embodiment of the present disclosure. As shown in FIG. 6 , the board includes a storage device 604 for storing data, which includes one or more storage units 610 . The storage device can be connected to the control device 608 and the chip 602 described above for connection and data transmission through, for example, a bus. Further, the board also includes an external interface device 606, which is configured for data relay or transfer function between the chip (or a chip in a chip package structure) and an external device 612 (such as a server or a computer, etc.). For example, the data to be processed can be transmitted to the chip by an external device through an external interface device. For another example, the calculation result of the chip may be transmitted back to the external device via the external interface device. According to different application scenarios, the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface and the like.

In one or more embodiments, the control device in the board of the present disclosure may be configured to regulate the state of the chip. To this end, in an application scenario, the control device may include a single-chip microcomputer (Micro Controller Unit, MCU) for regulating the working state of the chip.

According to the above description in conjunction with FIG. 5 and FIG. 6 , those skilled in the art can understand that the present disclosure also discloses an electronic device or device, which may include one or more of the above-mentioned boards, one or more of the above-mentioned chips and/or one or a plurality of the above-mentioned combined processing devices.

According to different application scenarios, the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment. The vehicles include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph. The electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal. In one or more embodiments, the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.

It should be noted that, for the purpose of simplicity, the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.

In terms of specific implementation, based on the disclosure and teaching of this disclosure, those skilled in the art can understand that several embodiments disclosed in this disclosure can also be implemented in other ways not disclosed herein. For example, as for each unit in the foregoing electronic device or apparatus embodiment, this article divides them on the basis of considering logical functions, and there may also be other division methods in actual implementation. As another example, multiple units or components may be combined or integrated into another system, or some features or functions of a unit or component may be selectively disabled. As far as the connection relationship between different units or components is concerned, the connections discussed above in conjunction with the accompanying drawings may be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In this disclosure, units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units. The aforementioned components or elements may be co-located or distributed over multiple network elements. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.

In some implementation scenarios, the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the aspects of the present disclosure are embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure. The aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.

In other implementation scenarios, the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors. In this regard, the various types of devices described herein (eg, computing devices or other processing devices) may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.

The foregoing can be better understood in accordance with the following terms:

Clause 1. A method of processing, the method comprising:

The first operation of the acquisition instruction, the first operation is an operation on tensor data, and the shape coordinate space of the tensor data includes at least one fine-grained region, and the fine-grained region includes one or one of the shape coordinate space. Multiple adjacent coordinate points;

determining whether there is an ongoing second operation on the tensor data;

When the second operation exists, determining whether the first fine-grained region currently targeted by the first operation overlaps with the second fine-grained region currently targeted by the second operation; and

The first operation is performed when the first fine-grained region does not overlap with the second fine-grained region.

Clause 2. The method of clause 1, wherein the method further comprises:

The first operation is blocked when the first fine-grained region overlaps the second fine-grained region.

Clause 3. The method of any of clauses 1-2, wherein the method further comprises:

determining whether the data operation range of the first operation overlaps the data operation range of the second operation;

When the data operation ranges overlap, performing the determining whether the first fine-grained region currently targeted by the first operation overlaps with the second fine-grained region currently targeted by the second operation; and

The first operation is performed when the data operation ranges do not overlap.

Clause 4. The method of Clause 3, wherein determining whether the data manipulation scope of the first operation overlaps the data manipulation scope of the second operation is determined based on at least one of the following:

Spatial information of the tensor data to be operated on; and/or

The shape information of the tensor data to be operated on.

Clause 5. The method of any of clauses 1-4, wherein the method further comprises:

determining a first coordinate space range of the tensor data that is allowed to be used by the first operation;

determining a second coordinate space extent of the tensor data to be used when performing the first operation; and

Perform the first operation within a third coordinate space range determined by the intersection of the first coordinate space range and the second coordinate space range;

The first coordinate space range, the second coordinate space range and the third coordinate space range are characterized by the fine-grained region.

Clause 6. The method of clause 5, wherein the first coordinate space extent is determined based on at least one of:

the sequence of operations;

the number of operands involved in the operation;

the second coordinate space extent of the prior operation; and

A predetermined division of the shape coordinate space of the tensor data.

Clause 7. The method of any of clauses 5-6, wherein the second coordinate space extent is determined based on at least one of the following:

the scope of execution of the operation;

the access mode of the operation;

the current execution status of the operation; and

A predetermined division of the shape coordinate space of the tensor data.

Clause 8. The method of any of clauses 5-7, wherein:

Determining the first coordinate space extent includes: determining an upper bound on the coordinate space of one or more dimensions of the tensor data that is allowed to be used by the first operation; and/or

Determining the second coordinate space extent includes determining a coordinate space lower bound for one or more dimensions of the tensor data expected to be used by the first operation.

Clause 9. The method of any of clauses 1-8, wherein the size and/or number of the fine-grained regions is determined based, at least in part, on at least one of the following:

the computing power of the hardware;

the bandwidth of the hardware; and

The size of the shape coordinate space of the tensor data.

Clause 10. The method of any of clauses 1-9, wherein at least one of the first operation and the second operation is a write operation.

Clause 11. The method of any of clauses 1-10, wherein:

The first operation and the second operation are respectively operations in different instructions executed in parallel; or

The first operation and the second operation are respectively different operations performed in parallel in the same instruction.

Clause 12. A processing device comprising:

An operation obtaining unit configured to obtain a first operation of an instruction, where the first operation is an operation on tensor data, the shape coordinate space of the tensor data includes at least one fine-grained area, and the fine-grained area includes all one or more adjacent coordinate points in the shape coordinate space;

a first determining unit, configured to determine whether there is an ongoing second operation on the tensor data;

A second determining unit, configured to determine whether the first fine-grained region currently targeted by the first operation and the second fine-grained region currently targeted by the second operation overlap when the second operation exists ;as well as

An execution unit configured to execute the first operation when the first fine-grained region does not overlap with the second fine-grained region.

Clause 13. The processing apparatus of clause 12, wherein the processing apparatus further comprises:

A blocking unit, configured to block the first operation when the first fine-grained region and the second fine-grained region overlap.

Clause 14. The processing apparatus of any of clauses 12-13, wherein the processing apparatus further comprises:

a third determination unit configured to determine whether the data operation range of the first operation overlaps the data operation range of the second operation; and

The second determining unit is configured to, when the third determining unit determines that the data operation ranges overlap, perform the determining of the first fine-grained area currently targeted by the first operation and the current second operation. whether there is overlap in the second fine-grained region targeted; and

The execution unit is configured to execute the first operation when the third determination unit determines that the data operation ranges do not overlap.

Clause 15. The processing device of Clause 14, wherein the third determination unit determines whether the data manipulation range of the first operation overlaps the data manipulation range of the second operation based on at least one of the following:

Spatial information of the tensor data to be operated on; and/or

The shape information of the tensor data to be operated on.

Clause 16. The processing device of clauses 12-15, wherein the second determining unit further comprises:

a first determination subunit, configured to determine a first coordinate space range of the tensor data that is allowed to be used by the first operation;

a second determination subunit, configured to determine a second coordinate space range of the tensor data to be used when performing the first operation; and

The execution unit is further configured to execute the first operation within a third coordinate space range determined by the intersection of the first coordinate space range and the second coordinate space range,

Clause 17. The processing device of Clause 16, wherein the first coordinate space extent is determined based on at least one of: a sequence of operations;

the number of operands involved in the operation;

the second coordinate space extent of the prior operation; and

A predetermined division of the shape coordinate space of the tensor data.

Clause 18. The processing device of any of clauses 16-17, wherein the second coordinate space extent is determined based on at least one of:

the scope of execution of the operation;

the access mode of the operation;

the current execution status of the operation; and

A predetermined division of the shape coordinate space of the tensor data.

Clause 19. The processing device of any of clauses 16-18, wherein:

The first determination subunit is further configured to: determine an upper bound on the coordinate space of one or more dimensions of the tensor data that is allowed to be used by the first operation; and/or

The second determination subunit is further configured to: determine a coordinate space lower bound of one or more dimensions of the tensor data that is expected to be used by the first operation.

Clause 20. The processing device of any of clauses 12-19, wherein the size and/or number of the fine-grained regions is determined based, at least in part, on at least one of the following:

the computing power of the hardware;

the bandwidth of the hardware; and

The size of the shape coordinate space of the tensor data.

Clause 21. The processing device of any of clauses 12-20, wherein at least one of the first operation and the second operation is a write operation.

Clause 22. The processing device of any of clauses 12-21, wherein:

Clause 23. A chip comprising the processing device of any of clauses 12-22.

Clause 24. A board comprising the chip of clause 23.

Claims

A processing method, the method includes:

The first operation of the acquisition instruction, the first operation is an operation on tensor data, the shape coordinate space of the tensor data includes at least one fine-grained region, and the fine-grained region includes one or one of the shape coordinate space. Multiple adjacent coordinate points;

determining whether there is an ongoing second operation on the tensor data;

When the second operation exists, determining whether the first fine-grained region currently targeted by the first operation overlaps with the second fine-grained region currently targeted by the second operation; and

The first operation is performed when the first fine-grained region does not overlap with the second fine-grained region.
The method of claim 1, wherein the method further comprises:

The first operation is blocked when the first fine-grained region overlaps the second fine-grained region.
The method according to any one of claims 1-2, wherein the method further comprises:

determining whether the data operation range of the first operation overlaps the data operation range of the second operation;

When the data operation ranges overlap, performing the determining whether the first fine-grained region currently targeted by the first operation overlaps with the second fine-grained region currently targeted by the second operation; and

The first operation is performed when the data operation ranges do not overlap.
3. The method of claim 3, wherein determining whether the data manipulation range of the first operation overlaps the data manipulation range of the second operation is determined based on at least one of the following:

Spatial information of the tensor data to be operated on; and/or

The shape information of the tensor data to be operated on.
The method according to any one of claims 1-4, wherein the method further comprises:

determining a first coordinate space range of the tensor data that is allowed to be used by the first operation;

determining a second coordinate space extent of the tensor data to be used when performing the first operation; and

Perform the first operation within a third coordinate space range determined by the intersection of the first coordinate space range and the second coordinate space range;

The first coordinate space range, the second coordinate space range and the third coordinate space range are characterized by the fine-grained region.
6. The method of claim 5, wherein the first coordinate space extent is determined based on at least one of the following:

the sequence of operations;

the number of operands involved in the operation;

the second coordinate space extent of the prior operation; and

A predetermined division of the shape coordinate space of the tensor data.
The method of any one of claims 5-6, wherein the second coordinate space extent is determined based on at least one of the following:

the scope of execution of the operation;

the access mode of the operation;

the current execution status of the operation; and

A predetermined division of the shape coordinate space of the tensor data.
The method according to any one of claims 5-7, wherein:

Determining the first coordinate space extent includes: determining an upper bound on the coordinate space of one or more dimensions of the tensor data that is allowed to be used by the first operation; and/or

Determining the second coordinate space extent includes determining a coordinate space lower bound for one or more dimensions of the tensor data expected to be used by the first operation.
8. The method of any one of claims 1-8, wherein the size and/or number of the fine-grained regions is determined based at least in part on at least one of the following:

the computing power of the hardware;

the bandwidth of the hardware; and

The size of the shape coordinate space of the tensor data.
The method of any one of claims 1-9, wherein at least one of the first operation and the second operation is a write operation.
The method according to any one of claims 1-10, wherein:

The first operation and the second operation are respectively operations in different instructions executed in parallel; or

The first operation and the second operation are respectively different operations performed in parallel in the same instruction.
A processing device, comprising:

An operation obtaining unit configured to obtain a first operation of an instruction, where the first operation is an operation on tensor data, the shape coordinate space of the tensor data includes at least one fine-grained area, and the fine-grained area includes all one or more adjacent coordinate points in the shape coordinate space;

a first determining unit, configured to determine whether there is an ongoing second operation on the tensor data;

A second determining unit, configured to determine whether the first fine-grained region currently targeted by the first operation and the second fine-grained region currently targeted by the second operation overlap when the second operation exists ;as well as

An execution unit configured to execute the first operation when the first fine-grained region does not overlap with the second fine-grained region.
The processing device of claim 12, wherein the processing device further comprises:

A blocking unit, configured to block the first operation when the first fine-grained region and the second fine-grained region overlap.
The processing device according to any one of claims 12-13, wherein the processing device further comprises:

a third determination unit configured to determine whether the data operation range of the first operation overlaps the data operation range of the second operation; and

The second determining unit is configured to, when the third determining unit determines that the data operation ranges overlap, perform the determining of the first fine-grained area currently targeted by the first operation and the current second operation. Whether there is overlap in the second fine-grained region targeted; and

The execution unit is configured to execute the first operation when the third determination unit determines that the data operation ranges do not overlap.
The processing apparatus according to claim 14, wherein the third determination unit determines whether the data operation range of the first operation overlaps the data operation range of the second operation based on at least one of the following:

Spatial information of the tensor data to be operated on; and/or

The shape information of the tensor data to be operated on.
The processing device according to claims 12-15, wherein the second determination unit further comprises:

a first determination subunit, configured to determine a first coordinate space range of the tensor data that is allowed to be used by the first operation;

a second determination subunit, configured to determine a second coordinate space range of the tensor data to be used when performing the first operation; and

The execution unit is further configured to execute the first operation within a third coordinate space range determined by the intersection of the first coordinate space range and the second coordinate space range,

The first coordinate space range, the second coordinate space range and the third coordinate space range are characterized by the fine-grained region.
The processing device of claim 16, wherein the first coordinate space range is determined based on at least one of the following:

the sequence of operations;

the number of operands involved in the operation;

the second coordinate space extent of the prior operation; and

A predetermined division of the shape coordinate space of the tensor data.
The processing device according to any one of claims 16-17, wherein the second coordinate space range is determined based on at least one of the following:

the scope of execution of the operation;

the access mode of the operation;

the current execution status of the operation; and

A predetermined division of the shape coordinate space of the tensor data.
The processing device according to any one of claims 16-18, wherein:

The first determination subunit is further configured to: determine an upper bound on the coordinate space of one or more dimensions of the tensor data that is allowed to be used by the first operation; and/or

The second determination subunit is further configured to: determine a coordinate space lower bound of one or more dimensions of the tensor data expected to be used by the first operation.
The processing device of any of claims 12-19, wherein the size and/or number of the fine-grained regions is determined at least in part based on at least one of the following:

the computing power of the hardware;

the bandwidth of the hardware; and

The size of the shape coordinate space of the tensor data.
The processing apparatus according to any one of claims 12-20, wherein at least one of the first operation and the second operation is a write operation.
The processing device according to any one of claims 12-21, wherein:

The first operation and the second operation are respectively operations in different instructions executed in parallel; or

The first operation and the second operation are respectively different operations performed in parallel in the same instruction.
A chip, characterized in that the chip includes the processing device according to any one of claims 12-22.
A board, characterized in that the board comprises the chip of claim 23 .