WO2022100345A1 - Processing method, processing apparatus, and related product - Google Patents

Processing method, processing apparatus, and related product Download PDF

Info

Publication number
WO2022100345A1
WO2022100345A1 PCT/CN2021/123552 CN2021123552W WO2022100345A1 WO 2022100345 A1 WO2022100345 A1 WO 2022100345A1 CN 2021123552 W CN2021123552 W CN 2021123552W WO 2022100345 A1 WO2022100345 A1 WO 2022100345A1
Authority
WO
WIPO (PCT)
Prior art keywords
fine
data
coordinate space
tensor data
grained
Prior art date
Application number
PCT/CN2021/123552
Other languages
French (fr)
Chinese (zh)
Inventor
刘少礼
郝勇峥
张英男
王秉睿
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Publication of WO2022100345A1 publication Critical patent/WO2022100345A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7817Specially adapted for signal processing, e.g. Harvard architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead

Definitions

  • the present disclosure relates to the field of processors, and in particular, to a processing method, a processing device, a chip and a board.
  • the instruction system is the interface between computer software and hardware interaction, and is a very important part of the computer system structure.
  • the instruction system is the interface between computer software and hardware interaction, and is a very important part of the computer system structure.
  • the present disclosure proposes solutions to enhance instruction parallelism in various aspects.
  • the degree of instruction parallelism can be improved, thereby improving the processing efficiency of the machine.
  • the present disclosure provides a processing method, the method comprising: obtaining a first operation of an instruction, where the first operation is an operation on tensor data, the shape coordinate space of the tensor data includes at least a fine-grained region, the fine-grained region including one or more adjacent coordinate points of the shape coordinate space; determining whether there is an ongoing second operation on the tensor data; in the presence of the second operation When , determine whether the first fine-grained area currently targeted by the first operation overlaps with the second fine-grained area currently targeted by the second operation; When the fine-grained regions do not overlap, the first operation is performed.
  • the present disclosure provides a processing device, comprising: an operation acquisition unit configured to acquire a first operation of an instruction, the first operation being an operation on tensor data, the shape of the tensor data
  • the coordinate space includes at least one fine-grained area, and the fine-grained area includes one or more adjacent coordinate points of the shape coordinate space;
  • a first determination unit is configured to determine whether there is an ongoing process for the tensor data the second operation;
  • the second determination unit is configured to, when the second operation exists, determine the first fine-grained region currently targeted by the first operation and the second fine-grained region currently targeted by the second operation whether the granular regions overlap;
  • an execution unit configured to perform the first operation when the first fine-grained region does not overlap with the second fine-grained region.
  • the present disclosure provides a chip including the processing device of any embodiment of the foregoing second aspect.
  • the present disclosure provides a board including the chip of any embodiment of the foregoing third aspect.
  • the embodiments of the present disclosure perform operations in parallel based on the fine-grained region of the shape coordinate space of the tensor data targeted by the operation during the execution of the operation of the instruction. limit, so that the parallel execution potential of the operation can be exploited. Therefore, according to the embodiments of the present disclosure, during the parallel execution of hardware, the consistency of the execution order can be ensured, and the degree of parallelism of operations can be improved, thereby ensuring the accuracy and efficiency of processing.
  • FIG. 1A shows a schematic diagram of a data storage space according to an embodiment of the present disclosure
  • FIG. 1B shows a schematic diagram of data partitioning in a data storage space according to an embodiment of the present disclosure
  • FIG. 2 shows a schematic block diagram of a processing apparatus according to an embodiment of the present disclosure
  • 3A-3C show schematic flowcharts of a processing method according to an embodiment of the present disclosure
  • 3D shows a schematic block diagram of a processing apparatus according to an embodiment of the present disclosure
  • FIG. 4 shows a schematic diagram of a coordinate space range according to an embodiment of the present disclosure
  • FIG. 5 shows a structural diagram of a combined processing apparatus according to an embodiment of the present disclosure.
  • FIG. 6 shows a schematic structural diagram of a board according to an embodiment of the present disclosure.
  • the term “if” may be contextually interpreted as “when” or “once” or “in response to determining” or “in response to detecting”.
  • the phrases “if it is determined” or “if the [described condition or event] is detected” may be interpreted, depending on the context, to mean “once it is determined” or “in response to the determination” or “once the [described condition or event] is detected. ]” or “in response to detection of the [described condition or event]”.
  • an instruction In order to indicate the source of the data, the destination of the operation result and the operation performed, an instruction usually contains the following information:
  • Operation Code which is used to indicate the operation to be completed by the instruction (for example, addition, subtraction, multiplication, division, data transfer, etc.), which specifies the nature and function of the operation.
  • a computer may have dozens to hundreds of instructions, each instruction has a corresponding opcode, and the computer completes different operations by identifying the opcode.
  • Operand which is used to describe the operation object of the instruction.
  • the operand can relate to the data type, memory access address, addressing mode, etc. of the object being operated.
  • the operand can directly give the operated object, or point out the memory address or register address (ie register name) of the operated object.
  • the instructions of conventional processors are designed to perform basic single-data scalar operations.
  • a single-data scalar operation means that each operand of the instruction is a scalar data.
  • the oriented operands are often multi-dimensional vector (ie, tensor data) data types, and only using scalar operations cannot make hardware efficient Complete the operation task. Therefore, how to efficiently perform multi-dimensional tensor data processing is also an urgent problem to be solved in the current computing field.
  • an instruction system in which a descriptor is included in an operand of the instruction, through which information related to tensor data can be obtained.
  • the descriptor may indicate at least one of the following information: shape information of tensor data, and spatial information of tensor data.
  • the shape information of the tensor data can be used to determine the data address in the data storage space of the tensor data corresponding to the operand.
  • Spatial information of tensor data can be used to determine dependencies between instructions, which in turn can determine, for example, the execution order of instructions.
  • the spatial information of tensor data can be indicated by a spatial identification (ID).
  • ID spatial identification
  • a space ID can also be called a space alias, which refers to a space area used to store the corresponding tensor data.
  • the space area can be a continuous space or multiple space. This disclosure does not have any specific composition of the space area. limit.
  • Different spatial IDs indicate that there is no dependency between the pointed spatial regions. For example, it can be ensured that no dependencies exist by making the spatial regions pointed to by different spatial IDs not overlapped with each other.
  • Tensors can contain many forms of data composition. Tensors can be of different dimensions. For example, scalars can be regarded as 0-dimensional tensors, vectors can be regarded as 1-dimensional tensors, and matrices can be 2-dimensional or more than 2-dimensional tensors.
  • the shape of a tensor includes information such as the dimension of the tensor and the dimensions of each dimension of the tensor. For example, for a three-dimensional tensor:
  • the three-dimensional tensor in the example above can be represented as (2, 2, 3) with descriptors. It should be noted that the present disclosure does not limit the manner in which the descriptor indicates the shape of the tensor.
  • the value of N may be determined according to the dimension (also referred to as the order) of the tensor data, and may also be set according to the usage needs of the tensor data. For example, when the value of N is 3, the tensor data is three-dimensional tensor data, and the descriptor can be used to indicate the shape (eg offset, size, etc.) of the three-dimensional tensor data in the three-dimensional direction. It should be understood that those skilled in the art can set the value of N according to actual needs, which is not limited in the present disclosure.
  • tensor data can be multi-dimensional, because the layout of memory is always one-dimensional, there is a correspondence between tensors and storage on memory.
  • Tensor data is usually allocated in contiguous storage space, that is, the tensor data can be expanded one-dimensionally (eg, row-major manner) and stored on the memory.
  • This relationship between tensors and the underlying storage can be represented by the offset of the dimension (offset), the size of the dimension (size), the stride of the dimension (stride), and so on.
  • the offset of a dimension refers to the offset relative to the reference position in that dimension.
  • the size of a dimension refers to the size of the dimension, that is, the number of elements in the dimension.
  • the step size of the dimension refers to the interval between adjacent elements in this dimension. For example, the step size of the three-dimensional tensor above is (6,3,1), that is, the step size of the first dimension is 6, and the second step size is 6.
  • the step size of the dimension is 3, and the step size of the third dimension is 1.
  • FIG. 1A shows a schematic diagram of a data storage space according to an embodiment of the present disclosure.
  • the data storage space 21 stores a two-dimensional data in a row-first manner, which can be represented by (x, y) (where the X axis is horizontally to the right, and the Y axis is vertically downward).
  • the size in the X-axis direction (the size of each row, or the total number of columns) is ori_x (not shown in the figure)
  • the size in the Y-axis direction (the total number of rows) is ori_y (not shown in the figure)
  • the starting address PA_start (reference address) of 21 is the physical address of the first data block 22 .
  • the data block 23 is part of the data in the data storage space 21, the offset 25 in the X-axis direction is represented as offset_x, the offset 24 in the Y-axis direction is represented as offset_y, and the size in the X-axis direction is represented by is size_x, and the size in the Y-axis direction is represented by size_y.
  • the data reference point of the descriptor may use the first data block of the data storage space 21, and the reference address of the descriptor may be agreed as the data storage space 21 The starting address of PA_start. Then the size ori_x on the X axis, the size ori_y on the Y axis of the data storage space 21, and the offset amount offset_y in the Y axis direction, the offset amount offset_x in the X axis direction, and the offset amount in the X axis direction of the data block 23 can be combined.
  • the content of the descriptor of the data block 23 is determined by the size size_x and the size size_y in the Y-axis direction.
  • the content of the descriptor represents a two-dimensional space
  • those skilled in the art can set the specific dimension of the content of the descriptor according to the actual situation, which is not limited in the present disclosure.
  • the reference address of the data reference point of the descriptor in the data storage space may be agreed upon.
  • the base address PA_base of the data base point of the descriptor in the data storage space may be agreed.
  • one piece of data for example, the data at the position (2, 2)
  • the physical address of the data in the data storage space may be used as the reference address PA_base.
  • the content of the descriptor of the data block 23 in FIG. 1A can be determined according to the positions of the two diagonal vertices relative to the data reference point.
  • the positions of at least two vertices of the diagonal positions of the data block 23 relative to the data reference point are determined, for example, the positions of the diagonal position vertices relative to the data reference point in the upper left to lower right direction are used, wherein the upper left corner vertex is The relative position is (x_min, y_min), the relative position of the lower right vertex is (x_max, y_max), and then the relative position of the upper left vertex (x_min, y_min) and the relative position of the lower right vertex (x_max, y_max) determines the content of the descriptor of the data block 23.
  • the following formula (2) can be used to represent the content of the descriptor (the base address is PA_base):
  • the tensor can be determined according to the reference address of the data reference point of the descriptor in the data storage space and the mapping relationship between the data description position and the data address of the tensor data indicated by the descriptor.
  • the content of the descriptor of the quantity data can be set according to actual needs. For example, when the tensor data indicated by the descriptor is three-dimensional space data, the function f(x, y, z) can be used to define The data describes the mapping relationship between the location and the data address.
  • the descriptor is further used to indicate the address of N-dimensional tensor data, wherein the content of the descriptor further includes at least one address parameter representing the address of the tensor data, for example, the content of the descriptor may be is the following formula (4):
  • PA is the address parameter.
  • the address parameter can be a logical address or a physical address.
  • PA can be used as any one of the vertex, middle point or preset point of the vector shape, and the corresponding data address can be obtained by combining the shape parameters in the X direction and the Y direction.
  • the address parameter of the tensor data includes a reference address of the data reference point of the descriptor in the data storage space of the tensor data, and the reference address includes a start address of the data storage space.
  • the descriptor may further include at least one address parameter representing the address of the tensor data, for example, the content of the descriptor may be the following formula (5):
  • PA_start is a reference address parameter, which is not repeated here.
  • mapping relationship between the data description location and the data address can be set according to the actual situation, which is not limited in the present disclosure.
  • a predetermined reference address may be set in a task, the descriptors in the instructions under this task all use the reference address, and the content of the descriptor may include shape parameters based on the reference address.
  • the base address can be determined by setting the environment parameters for this task. For the relevant description and usage of the reference address, reference may be made to the foregoing embodiments.
  • the content of the descriptor can be mapped to the data address more quickly.
  • a reference address may be included in the content of each descriptor, and the reference address of each descriptor may be different. Compared with the way of setting a common reference address by using environment parameters, each descriptor in this way can describe data more flexibly and use a larger data address space.
  • the data address in the data storage space of the data corresponding to the operand of the processing instruction may be determined according to the content of the descriptor.
  • the calculation of the data address is automatically completed by the hardware, and when the representation of the content of the descriptor is different, the calculation method of the data address will also be different. This disclosure does not limit the specific calculation method of the data address.
  • the content of the descriptor in the operand is represented by formula (1)
  • the offsets of the tensor data indicated by the descriptor in the data storage space are offset_x and offset_y respectively
  • the size is size_x*size_y
  • the The starting data address PA1 (x, y) of the tensor data indicated by the descriptor in the data storage space can be determined using the following formula (6):
  • PA1 (x,y) PA_start+(offset_y-1)*ori_x+offset_x(6)
  • the data address in the data storage space of the data corresponding to the operand can be determined according to the content of the descriptor and the data description location. In this way, part of the data (eg, one or more data) in the tensor data indicated by the descriptor can be processed.
  • the content of the descriptor in the operand is represented by formula (2).
  • the offsets of the tensor data indicated by the descriptor in the data storage space are offset_x and offset_y respectively, and the size is size_x*size_y.
  • the operand includes The data description position for the descriptor is (x q , y q ), then, the data address PA2 (x, y) of the tensor data indicated by the descriptor in the data storage space can use the following formula (7) to make sure:
  • PA2 (x,y) PA_start+(offset_y+y q -1)*ori_x+(offset_x+x q ) (7)
  • the descriptor may indicate chunked data.
  • Data block can effectively speed up the operation speed and improve the processing efficiency in many applications. For example, in graphics processing, convolution operations often use data blocks for fast processing.
  • FIG. 1B shows a schematic diagram of a data block in a data storage space according to an embodiment of the present disclosure.
  • the data storage space 26 also stores two-dimensional data in a row-first manner, which can be represented by (x, y) (where the X axis is horizontally to the right, and the Y axis is vertically downward).
  • the size in the X-axis direction (the size of each row, or the total number of columns) is ori_x (not shown in the figure), and the size in the Y-axis direction (the total number of rows) is ori_y (not shown in the figure).
  • the tensor data stored in FIG. 1B includes multiple data blocks.
  • the descriptor requires more parameters to represent these data chunks.
  • X dimension X dimension
  • the following parameters can be involved: ori_x, x.tile.size (size 27 in the block), x.tile.stride (step size 28 in the block, that is, the first small distance between the first point of the block and the first point of the second tile), x.tile.num (the number of tiles, shown as 3 tiles in Figure 1B), x.stride (the overall stride length, that is, the distance from the first point of the first row to the first point of the second row) and so on.
  • Other dimensions may similarly include corresponding parameters.
  • the descriptor may include the identifier of the descriptor and/or the content of the descriptor.
  • the identifier of the descriptor is used to distinguish the descriptor, for example, the identifier of the descriptor may be numbered; the content of the descriptor may include at least one shape parameter representing the shape of the tensor data.
  • the tensor data is 3-dimensional data. Among the three dimensions of the tensor data, the shape parameters of two dimensions are fixed, and the content of the descriptor may include the shape representing the other dimension of the tensor data. parameter.
  • the data address of the data storage space corresponding to each descriptor may be a fixed address.
  • a separate data storage space can be divided for tensor data, and the starting address of each tensor data in the data storage space corresponds to a descriptor one-to-one.
  • the circuit or module responsible for parsing the computing instruction eg, an entity external to the computing device of the present disclosure
  • the descriptor when the data address of the data storage space corresponding to the descriptor is a variable address, the descriptor can also be used to indicate the address of N-dimensional tensor data, wherein the content of the descriptor can also be Include at least one address parameter representing the address of the tensor data.
  • the content of the descriptor may include an address parameter indicating the address of the tensor data, such as the starting physical address of the tensor data, It may also include multiple address parameters of the address of the tensor data, such as the start address + address offset of the tensor data, or address parameters of the tensor data based on each dimension.
  • address parameters such as the start address + address offset of the tensor data, or address parameters of the tensor data based on each dimension.
  • the address parameter of the tensor data may include the reference address of the data reference point of the descriptor in the data storage space of the tensor data.
  • the reference address can be different according to the change of the data reference point. This disclosure does not limit the selection of data benchmarks.
  • the reference address may include the start address of the data storage space.
  • the reference address of the descriptor is the starting address of the data storage space.
  • the reference address of the descriptor is the address of the data block in the data storage space.
  • the shape parameter of the tensor data includes at least one of the following: the size of the data storage space in at least one direction of N dimensions, the size of the storage area in at least one direction of N dimensions The size of the storage area, the offset of the storage area in at least one direction of the N dimension directions, the position of at least two vertices at the diagonal positions of the N dimension directions relative to the data reference point, the tensor indicated by the descriptor The mapping relationship between the data description location of the data and the data address.
  • the data description position is the mapping position of the point or area in the tensor data indicated by the descriptor.
  • the descriptor can be represented by three-dimensional space coordinates (x, y, z).
  • the shape of the tensor data, and the data description position of the tensor data may be the position of a point or area in the three-dimensional space that the tensor data is mapped to, represented by three-dimensional space coordinates (x, y, z).
  • FIG. 2 shows a schematic block diagram of a processing apparatus according to an embodiment of the present disclosure.
  • the processing device 200 includes a control module 210 , an arithmetic module 220 and a storage module 230 .
  • the control module 210 can be configured to control the operation of the processing device 200, such as reading instructions from memory or externally, decoding (decoding) the instructions, and issuing micro-operation control signals to corresponding components. Specifically, the control module 210 may be configured to control the execution unit 220 to perform corresponding processing according to the received instruction.
  • the instructions may include, but are not limited to, data access instructions, operation instructions, descriptor management instructions, and synchronization instructions. The present disclosure does not limit the specific type of instruction and the specific manner of decoding.
  • the decoded instruction includes the opcode and operands.
  • at least one operand of the instruction may include at least one descriptor indicating at least one of the following: shape information of the tensor data and spatial information of the tensor data.
  • the arithmetic module 220 is configured to execute specific instructions or operations under the control of the control module 210 .
  • the operation module 220 may include, for example, but not limited to, an arithmetic logic unit (arithmetic and logic unit, ALU), a memory access unit (memory access unit, MAU), an artificial intelligence operation unit (neural functional unit, NFU), and the like.
  • ALU arithmetic logic unit
  • MAU memory access unit
  • NFU artificial intelligence operation unit
  • the present disclosure does not limit the specific hardware type of the execution unit.
  • the storage module 230 may be configured to store various information including, but not limited to, instructions, information associated with descriptors, tensor data, and the like.
  • the storage module 230 may include various storage resources including, but not limited to, internal memory and external memory.
  • Internal memory may include, for example, registers, on-chip SRAM, or other medium caches.
  • External memory may include, for example, off-chip memory. The present disclosure does not limit the specific implementation of the memory module.
  • the processing apparatus 200 may further include a tensor interface unit (Tensor interface Unit, TIU) 240 .
  • the tensor interface unit 240 may be configured to implement operations associated with the descriptor under the control of the control module 210 . These operations may include, but are not limited to, registration, modification, cancellation, and parsing of descriptors; reading and writing of content of descriptors.
  • the present disclosure does not limit the specific hardware type of the tensor interface unit. In this way, operations associated with descriptors can be implemented through dedicated hardware, which further improves the access efficiency of tensor data.
  • tensor interface unit 240 may be configured to parse descriptors included in operands of instructions. For example, the tensor interface unit may parse the shape information of the tensor data included in the descriptor to determine the data address in the data storage space of the data corresponding to the operand.
  • control module 210 and the tensor interface unit 240 are shown as two separate modules in FIG. 2 , those skilled in the art can understand that these two modules/units can also be implemented as one module or more modules. Disclosure is not limited in this regard.
  • the data processing apparatus 200 can be implemented by a general-purpose processor (such as a central processing unit CPU, a graphics processing unit GPU) and/or a special-purpose processor (such as an artificial intelligence processor, a scientific computing processor, or a digital signal processor, etc.). There is no restriction on the specific type of data processing device.
  • a general-purpose processor such as a central processing unit CPU, a graphics processing unit GPU
  • a special-purpose processor such as an artificial intelligence processor, a scientific computing processor, or a digital signal processor, etc.
  • the hardware executes instructions in parallel, if there is a dependency between the instructions executed in parallel, it may cause an error in the execution result. For example, if two instructions executed in parallel access the same storage unit, and at least one of the two instructions is an instruction to write to the storage unit, there is a dependency between the two instructions, such as read-after-write dependency, write-after-write dependency Write dependencies, or read-after-write dependencies. In this case, if the latter instruction is executed before the previous instruction, an execution error will result. Therefore, the order consistency of the execution of these instructions must be guaranteed, for example, by forced sequential execution, that is, the subsequent instruction must wait for the completion of the previous instruction to execute.
  • tensor data is usually a multi-dimensional array with a large amount of data, so the instruction processing time for tensor data is usually longer than that for scalar data. At this time, if the tensor data is still processed according to the previous order execution method, the processing time is too long and the efficiency is low.
  • an operation-level instruction parallelism scheme is provided, in which the parallelism of operations is restricted based on a fine-grained region of the shape coordinate space of the tensor data targeted by the operation, so that mining can be performed. The parallel execution potential of the operation. Therefore, according to the embodiments of the present disclosure, during the parallel execution of hardware, the consistency of the execution order can be ensured, and the degree of parallelism of operations can be improved, thereby ensuring the accuracy and efficiency of processing.
  • FIG. 3A shows an exemplary flowchart of a processing method 300 according to an embodiment of the present disclosure.
  • the processing method 300 can be implemented, for example, by the processing apparatus 200 of FIG. 2 .
  • the method 300 starts at step S301 , a first operation of obtaining an instruction.
  • This step may be performed, for example, by the control module 210 of FIG. 2 .
  • the first operation is an operation on tensor data whose shape coordinate space includes at least one fine-grained region.
  • the fine-grained region may include one or more adjacent coordinate points in the shape coordinate space of the tensor data.
  • a fine-grained region is the smallest unit of operation.
  • the operations involved in the present disclosure may be basic operations supported by the processor hardware, or may be microinstructions (eg, request signals, etc.) after parsing the basic operations.
  • the present disclosure does not limit the specific type of operation.
  • the processing apparatus of the present disclosure may execute two operations in parallel, or may execute more than two operations in parallel, and the disclosure does not limit the number of operations executed in parallel.
  • the two operations performed in parallel may belong to the same instruction or may belong to different instructions, and the present disclosure is not limited in this respect.
  • the processor can execute multiple operations in parallel.
  • the processor will only execute one of the multiple operations. , while blocking other operations, thereby reducing the efficiency of the processor.
  • the shape coordinate space of the processed tensor data is further divided into a plurality of fine-grained regions, and whether the operation can be executed in parallel is determined based on the fine-grained regions, thereby greatly improving the efficiency of the processor.
  • the shape, size, and/or number of fine-grained regions may be determined based, at least in part, on at least one of: the computing power of the hardware; the bandwidth of the hardware; and the size of the shape coordinate space of the tensor data.
  • the hardware computing capability may be the amount of data that the hardware processes in parallel in one computing cycle, and the hardware bandwidth may be the data transmission capability, such as the amount of data transmitted per unit time.
  • the processor to which the processing method of the embodiment of the present disclosure is applied has the hardware computing capability of processing 100-bit data in parallel in one computing cycle, and the hardware bandwidth of transmitting 200-bit data per unit time.
  • the shape coordinate space of the tensor data can be divided into 100 fine-grained regions according to the hardware computing capability, wherein each fine-grained region includes 100-bit data; the shape coordinate space can also be divided according to the hardware bandwidth is 50 fine-grained regions, wherein each fine-grained region includes 200-bit data.
  • the hardware computing capability and hardware bandwidth may vary according to different processor hardware, and the present disclosure does not limit the hardware computing capability and hardware bandwidth.
  • the size and/or quantity of the fine-grained area can be determined according to the processing capability (hardware computing capability and/or hardware bandwidth) of the processor, so that the division result of the fine-grained area is more in line with the requirements of different hardware usage environments,
  • the operations performed according to the fine-grained area tend to be synchronized with the processing capability of the processor, and the execution efficiency of the hardware can be exerted as much as possible, thereby improving the processing efficiency of the processor.
  • the shapes and sizes of the multiple fine-grained regions may be the same or different.
  • the first operation may also carry a first fine-grained quantity (for example, set to 4)
  • the second operation may carry a second fine-grained quantity (for example, set to 8). That is, when the first operation is performed, the shape coordinate space is divided into 4 fine-grained regions, and when the second operation is performed, the shape coordinate space is divided into 8 fine-grained regions.
  • the operation can also carry fine-grained parameters such as shape, size, and quantity at the same time.
  • the shape, size and/or number of each fine-grained region can be determined according to requirements, which are not limited in the present disclosure.
  • step S302 it is determined whether there is an ongoing second operation on the tensor data.
  • the operand when the instruction involves the processing of tensor data, the operand includes a descriptor through which information related to the tensor data can be obtained. Therefore, in some embodiments, the descriptor may include spatial information of the tensor data (eg, a spatial identification ID), and the dependency between the instructions may be determined according to the spatial information of the tensor data. Since different spatial IDs indicate that there is no dependency between the spatial regions pointed to. Therefore, it can be quickly judged whether there is a dependency relationship between the two instructions according to whether the spatial IDs of the tensor data processed by the two instructions are the same, that is, whether they operate on the same tensor data.
  • spatial information of the tensor data eg, a spatial identification ID
  • whether there is an ongoing second operation on the tensor data may be determined according to the occupancy state of the data storage area corresponding to the tensor data. For example, the processor may determine whether the data storage area of the tensor data is occupied by querying the occupancy status list, and if it is occupied, the judgment result is that there is an ongoing second operation on the tensor data.
  • the occupancy state list may be preset and stored on the memory, or may be generated before the processor starts to execute a certain task, and is logged out after the task is completed. When the occupation status of each data storage area changes, the processor updates the content of the occupation status list to record the occupation status of each data storage area. The present disclosure does not limit the way of judging whether there is an ongoing second operation on one or more tensor data.
  • step S303 when there is such a second operation, it is determined whether there is overlap between the first fine-grained region currently targeted by the first operation and the second fine-grained region currently targeted by the second operation.
  • the first fine-grained region and the second fine-grained region may be any fine-grained regions among multiple fine-grained regions in the shape coordinate space of the tensor data.
  • an operation on tensor data is an operation on each fine-grained region in the shape coordinate space of the tensor data.
  • tensor data A is a two-dimensional matrix with 8 rows and 16 columns
  • its shape coordinate space is a two-dimensional space
  • every 2 rows and 4 columns is a fine-grained area
  • the shape coordinate space of the tensor data includes 16 fine-grained regions. granular area.
  • a write operation for tensor data A can be regarded as a write operation for the 16 fine-grained regions.
  • the execution process can be: write the first fine-grained area (row 1-2, column 1-4), and write the second fine-grained area after the first fine-grained area (row 1-2, column 5) -8 columns), write the third fine-grained area (row 1-2, column 9-12) after the second fine-grained area is written, and so on until the 16th fine-grained area (the seventh -8 rows 13-16 columns), complete the write operation of tensor data A.
  • operations can also be performed on multiple fine-grained regions at a time, for example, two or more fine-grained regions are written at a time until the operations on all regions are completed.
  • the state of the fine-grained region in the shape coordinate space of the tensor data may include the completed state of the operation, the state of being operated, and the state of not being operated; or , for the case where it is not necessary to record whether the operation has been completed, the status can include the occupied status and the available status.
  • the state of the fine-grained region currently targeted by the operation is the operating state or the occupied state. Therefore, when there is an operation on tensor data, it can be considered that there is an operation on a fine-grained region in the shape coordinate space of tensor data, and the fine-grained region that is being operated or is being occupied is the current operation.
  • the fine-grained region targeted is the fine-grained region targeted.
  • the first fine-grained area currently targeted by the first operation may include the fine-grained area targeted by the first operation to be executed. For example, when the operation is initially performed, it is specified to execute in a predetermined order. , usually the first fine-grained region. It may also include the fine-grained area currently targeted by the first operation being executed, which may be any fine-grained area.
  • the second fine-grained area currently targeted by the second operation may be the fine-grained area currently targeted by the second operation being executed, and may be any fine-grained area.
  • the first fine-grained first operation currently targeted by the first operation Region which is the fine-grained region where the first operation will be performed.
  • the first fine-grained region currently targeted by the first operation is usually the first fine-grained region in the shape coordinate space of the tensor data.
  • the first operation has not performed an operation on the first fine-grained region.
  • the second fine-grained area currently targeted by the ongoing second operation may be related to the execution process of the second operation.
  • the second fine-grained region may also be the first fine-grained region in the shape coordinate space of the tensor data. At this time, the first fine-grained area overlaps with the second fine-grained area. If the second operation has completed the operation of the first fine-grained area, and the second fine-grained area currently targeted is the P-th fine-grained area (P is an integer greater than 1), the first fine-grained area and the second fine-grained area The granular regions do not overlap.
  • the first operation when judging whether there is an ongoing second operation on the tensor data during the operation of the first operation on the tensor data, the first operation may be determined according to the execution process of the first operation.
  • the rhythm of the execution process of each operation is consistent, only before the first operation is performed on the tensor data, it can be judged whether there is a second operation on the tensor data in progress, and then judge whether there is an ongoing second operation on the tensor data. Whether the first fine-grained area overlaps with the second fine-grained area.
  • the beat consistency means that when the size of the fine-grained region is the same, the operation duration of two operations on a fine-grained region is the same.
  • each time the operation of the first fine-grained region currently targeted is completed. Then, continue to judge whether there is an ongoing second operation on the tensor data, and continue to judge whether the first fine-grained region and the second fine-grained region overlap, so as to determine whether the first operation can be continued.
  • the first fine-grained area currently targeted by the first operation and the second fine-grained area currently targeted by the second operation may be determined according to the coordinate address, pointer position, fine-grained area identifier, etc. whether they overlap.
  • the coordinate address of the current tensor data of each operation can be recorded, and the first operation can be determined according to the current coordinate address of the first operation and the current coordinate address of the second operation, as well as the correspondence between the coordinate address and the fine-grained area.
  • the first fine-grained area currently targeted and the second fine-grained area currently targeted by the second operation and then it is determined whether the first fine-grained area and the second fine-grained area overlap.
  • Both the coordinate address and the fine-grained region are defined based on the shape coordinate space of the tensor data. Therefore, after knowing the fine-grained division of the shape coordinate space, the corresponding fine-grained region can be directly determined from the coordinate address.
  • a pointer may be set for each operation, and the pointer points to the fine-grained region currently targeted by the operation. According to the pointer position of the first operation and the pointer position of the second operation, respectively determine the first fine-grained area currently targeted by the first operation and the second fine-grained area currently targeted by the second operation, and then determine the first fine-grained area Whether to overlap with the second fine-grained region.
  • an identifier may also be set for each fine-grained area, and whether the first fine-grained area and the second fine-grained area overlap is determined by recording the identifier of the fine-grained area currently targeted by the operation. Identification can include any combination of letters, numbers or symbols. It is also possible to judge whether the first fine-grained area and the second fine-grained area overlap in other ways, and the present disclosure does not limit the basis for judging whether the first fine-grained area and the second fine-grained area overlap.
  • step S304 when there is no overlap between the first fine-grained region and the second fine-grained region, a first operation is performed.
  • the first fine-grained area currently targeted by the first operation does not overlap with the second fine-grained area currently targeted by the second operation, it may be that the first fine-grained area is the first fine-grained area that the second operation has already targeted
  • the fine-grained area where the operation is completed may also be a fine-grained area where the second operation does not need to be performed. At this time, executing the first operation will not affect the operation process and operation result of the second operation, and the first operation can be performed.
  • the shape coordinate space of the tensor data targeted by the first operation includes at least one fine-grained region, and there is an ongoing second operation on the tensor data
  • the fine-grained regions of the current operation of the first operation and the second operation can be executed without overlapping, so that the first operation and the second operation can operate on the same tensor data at the same time, which improves the processing efficiency of the processor.
  • the processing method 300 may further include: blocking the first operation when the first fine-grained region overlaps with the second fine-grained region.
  • the first fine-grained region and the second fine-grained region overlap, including the first fine-grained region and the second fine-grained region completely or partially overlapping.
  • the operation of the first operation on the overlapping partial area may affect the execution of the second operation, resulting in an inaccurate operation result of the second operation.
  • the execution of the first operation may be affected, resulting in an inaccurate operation result of the first operation.
  • the first operation may be blocked, that is, the execution of the first operation may be suspended, and the first operation may be executed after the second operation completes the operation on the second fine-grained region currently targeted. That is, when the first fine-grained area does not overlap with the second fine-grained area, the first operation is performed.
  • blocking the first operation can avoid operation errors and inaccurate operation results caused by the overlapping of the fine-grained areas of each operation, ensuring that The correctness of each operation.
  • At least one of the first operation and the second operation may be a write operation. That is, when the operation on the target data is read after write (the second operation is a write operation, and the first operation is a read operation), read after write (the second operation is a read operation, and the first operation is a write operation), or write after write (both the second operation and the first operation are write operations), there will be a dependency relationship between the two operations, in which case the method in the embodiment of the present disclosure can be used.
  • operations such as read-after-write, read-after-write, and write-after-write can be achieved. It can be executed correctly to obtain an accurate execution result, and the waiting time between operations can be reduced, and the execution efficiency of the processor can be improved.
  • a processing method for determining the execution range of an operation based on the coordinate space range expressed by the fine-grained region is also provided.
  • FIG. 3B schematically shows an exemplary flowchart of a processing method according to an embodiment of the present disclosure.
  • the processing method of FIG. 3B can be implemented by, for example, the processing apparatus 200 of FIG. 2 .
  • a first coordinate space range of the number of tensors allowed to be used by the first operation is determined. This step may be performed, for example, by the control module 210 of FIG. 2 .
  • the first coordinate space range may be, for example, a portion of the shape coordinate space of the tensor data involved in the first operation.
  • step S312 the second coordinate space range of the tensor data to be used when the first operation is performed is determined. This step may be performed, for example, by the execution unit 220 of FIG. 2 .
  • the second coordinate space range may be, for example, a portion of the shape coordinate space of the tensor data involved in the first operation.
  • step S313 the first operation is performed within the third coordinate space range determined by the intersection of the first coordinate space range and the second coordinate space range. This step may be performed, for example, by the execution unit 220 of FIG. 2 .
  • the first coordinate space range, the second coordinate space range, and the third coordinate space range can all be expressed based on fine-grained regions in the shape coordinate space of the tensor data, that is, the first, The second and third coordinate space extents.
  • the overlap determination of the fine-grained regions described above in conjunction with FIG. 3A and FIG. 3B may be performed only under certain conditions, thereby shortening the determination time and speeding up instruction processing.
  • FIG. 3C schematically shows an exemplary flowchart of a processing method according to another embodiment of the present disclosure.
  • the first operation of the instruction is obtained.
  • the first operation is an operation on tensor data, and its operand may include a descriptor of the tensor data.
  • step S322 it is determined whether there is an ongoing second operation on the tensor data. This operation is similar to step S302 described above in conjunction with FIG. 3A , and will not be repeated here.
  • step S326 that is, execute the first operation. This means that there is no second operation that could conflict with the first operation, so the first operation can be executed immediately.
  • the first operation is performed in parallel with these other operations at this time.
  • step S323 it is further determined whether the data operation ranges of the first operation and the second operation overlap. It can be understood that since tensor data usually has a large dimension, the data operation range for different operations may be different. When the data operation ranges of different operations do not overlap with each other, the first operation can be performed in parallel with the preceding second operation without conflict.
  • whether the data manipulation ranges of an operation overlap may be determined based on spatial information and/or shape information of the tensor data to be operated on.
  • the shape information of the tensor data can be used to determine the access address of the operation, so as to determine whether there is overlap between the data operation ranges of the two operations.
  • the access address may be a coordinate space address of tensor data or a storage space address of tensor data, which is not limited in this aspect of the present disclosure.
  • step S323 If it is determined in step S323 that the data operation ranges of the first operation and the second operation do not overlap, the method may jump to step S326, that is, the first operation is performed.
  • step S326 the first operation is performed. This means that even if the first operation and the second operation access the same tensor data (determined in step S322), as long as the data operation ranges of the first operation and the second operation do not overlap, that is, those accessing the same tensor data respectively do not overlap each other. part, the first operation can be performed in parallel with the second operation.
  • step S323 If it is determined in step S323 that the data operation ranges of the first operation and the second operation overlap, the method may proceed to step S324, where it is further determined whether the fine-grained regions currently targeted by the first operation and the second operation overlap.
  • step S324 For a specific determination method, reference may be made to the foregoing description in conjunction with FIG. 3A and FIG. 3B .
  • step S324 When it is determined in step S324 that the data operation ranges of the first operation and the second operation do not overlap, the method may proceed to step S326, that is, the first operation is performed. Therefore, based on the dynamic execution of the operation, it is possible to dynamically determine whether the currently targeted fine-grained regions overlap, so as to realize the parallel execution of the operation at the level of the fine-grained region, and maximize the parallel potential of the operation.
  • step S324 If it is determined in step S324 that the data operation ranges of the two operations overlap, the first operation cannot be performed at this time, otherwise a conflict will be caused. Therefore, in step S325, the first operation is blocked.
  • FIG. 3D shows a schematic functional block diagram of a processing apparatus according to an embodiment of the present disclosure.
  • the processing apparatus 30 includes an operation acquisition unit 31 , a first determination unit 32 , a second determination unit 33 and an execution unit 34 .
  • the operation acquisition unit 31 is configured to acquire the first operation of the instruction.
  • the first operation is an operation on tensor data.
  • the shape coordinate space of the tensor data includes at least one fine-grained region, and each fine-grained region includes one or more adjacent coordinate points in the shape coordinate space.
  • the first determination unit 32 is configured to determine whether there is an ongoing second operation on the tensor data.
  • the second determining unit 33 is configured to, when there is such a second operation, determine whether the first fine-grained region currently targeted by the first operation overlaps with the second fine-grained region currently targeted by the second operation.
  • the executing unit 34 is configured to execute the first operation when the first fine-grained region does not overlap with the second fine-grained region.
  • the second determination unit 33 may include a first determination subunit 331 and a second determination subunit 332 .
  • the first determination subunit 331 is configured to determine a first coordinate space range of the tensor data that is allowed to be used by the first operation.
  • the second determination subunit 332 is configured to determine the second coordinate space range of the tensor data to be used when the first operation is performed.
  • the execution unit 34 may be configured to execute the first operation within the third coordinate space range determined by the intersection of the first coordinate space range and the second coordinate space range.
  • the first coordinate space extent, the second coordinate space extent, and the third coordinate space extent are characterized using fine-grained regions in the shape coordinate space of the tensor data.
  • the processing device 30 may further include a blocking unit 35 and a third determining unit 36 .
  • the blocking unit 35 may be configured to block the first operation to prevent a conflict from occurring when it is determined that the first operation overlaps with the fine-grained region currently targeted by the second operation.
  • the third determination unit 36 may be configured to perform a static judgment in advance, that is, to determine whether the data operation ranges of the first operation and the second operation overlap. The judgment of the second determination unit 33 is performed only when the data operation ranges overlap.
  • the execution unit 34 may execute the first operation according to the judgment result of each determination unit.
  • each unit shown in FIG. 3D is divided according to function implementation. This division is only exemplary, in actual implementation, two or more functions may be implemented in the same hardware unit, and one function may also be implemented in two hardware units distributed.
  • the operation obtaining unit 31, the first determining unit 32 and the optional third determining unit 36 may be included in the control module 210 of the processing apparatus 200 shown in FIG. 2, while the second determining unit 33 and The execution unit 34 may be included in the arithmetic module 220 of the processing device 200 .
  • the operation acquisition unit 31 , the first determination unit 32 , the second determination unit 33 and the optional third determination unit 36 may be included in the control module 210 of the processing apparatus 200 shown in FIG. 2
  • the execution unit 34 is included in the operation module 220 of the processing device 200 .
  • FIG. 4 schematically illustrates the division of coordinate space ranges according to an embodiment of the present disclosure.
  • FIG. 4 uses two-dimensional data as an example for illustrative illustration, but those skilled in the art can understand that the same solution can be similarly applied to three-dimensional or even more dimensional tensor data.
  • the shape coordinate space 400 of the two-dimensional tensor data is divided into 12 fine-grained regions, 4001, 4002, . . . , 4011 and 4012, respectively. On each fine-grained region, access to it is guaranteed to be sequential.
  • Any data element (eg, data point) on the tensor data can be represented by two-dimensional spatial coordinates (x, y) (where the X axis is horizontally to the right, and the Y axis is vertically downward). Obviously, the coordinates of any data element on the tensor data will not exceed the maximum size of the shape's coordinate space.
  • all fine-grained regions in the shape coordinate space of the tensor data that are not currently used by prior operations associated with the first operation may be determined as the first coordinate space range.
  • the spatial range ie, the first coordinate spatial range
  • the spatial range may include the fine-grained region 4001 -4003 as well as fine-grained regions 4005-4007, as shown with diagonal shading.
  • a fine-grained region determined based on the coordinates of the tensor data to be accessed by the first operation is determined as the second coordinate space range.
  • the execution at this time may be determined as fine-grained regions 4003-4012, as indicated by dot filling.
  • the operable range when the first operation is actually performed is the intersection of the first coordinate space range and the second coordinate space range.
  • the third coordinate space range is an area where both oblique line shadows and dot filling exist, that is, the fine-grained areas 4003 and 4005-4007 in FIG. 4 .
  • the first coordinate space range, the second coordinate space range, and the third coordinate space range can be directly characterized by the identification of the fine-grained region included in each of them.
  • the first coordinate space range can be characterized by the identification of fine-grained regions 4001-4003 and fine-grained regions 4005-4007;
  • the third coordinate space range can be characterized using the identification of fine-grained regions 4003 and 4005-4007.
  • the access to tensor data is usually based on a certain dimension
  • the access coordinates are gradually increased, and the data units on each coordinate point in the tensor data are traversed from front to back.
  • the first coordinate space extent is characterized using a coordinate upper bound on one or more dimensions of the tensor data that allows the fine-grained region used by the first operation; and/or the second coordinate space extent Characterized using coordinate lower bounds in one or more dimensions of the tensor data for the fine-grained region that the first operation is expected to use.
  • the first coordinate space extent may be characterized using a fine-grained region of upper bounds on coordinates on one or more dimensions of the tensor data that are allowed to be used by the first operation.
  • a fine-grained region of upper bounds on coordinates on one or more dimensions of the tensor data that are allowed to be used by the first operation.
  • the space range allowed to be used by the first operation (that is, the first coordinate space range) at this time may include the upper left two rows and three columns , a total of 6 fine-grained regions, as indicated by the diagonally shaded regions.
  • the first coordinate space range can be characterized by the X upper bound 411 on the X axis and the fine-grained region where the Y upper bound 421 on the Y axis is located.
  • the first coordinate space extent can be characterized using fine-grained regions 4003 and 4005, which indicate that the data coordinates accessed by the first operation cannot exceed the fine-grained region 4003 in the X dimension and cannot exceed the fine-grained region in the Y dimension Area 4005.
  • the second coordinate space extent may be characterized using a fine-grained region of lower bounds on coordinates on one or more dimensions of the tensor data that the first operation is expected to use. For example, when it is determined that the first operation will use fine-grained regions other than the two fine-grained regions on the left of the first row based on the coordinates of the tensor data to be accessed by the first operation, it can be determined that the second coordinate space range includes the remaining 10 a fine-grained region, as shown in the dot-filled section. At this time, in FIG.
  • the second coordinate space range can be characterized by the X lower bound 412 on the X axis and the fine-grained region where the Y lower bound 422 on the Y axis is located.
  • the second coordinate space range can be characterized by fine-grained regions 4002 and 4001, which indicate that when the first operation is performed, the X dimension in the tensor data is lower than the fine-grained region 4002, and the Y dimension is lower than that in the tensor data. Data in the fine-grained region 4001.
  • the operable range when the first operation is actually performed is the third coordinate space range, which is the intersection of the first coordinate space range and the second coordinate space range.
  • the third coordinate space range is an area where both oblique line shadows and dot filling exist, that is, the “inverse L-shaped” area in FIG. 4 .
  • the first coordinate space extent and the second coordinate space extent can be determined in various ways.
  • At least one of the following may be additionally considered to determine the first coordinate space range: the sequence of operations; the operands involved in the operations; the second coordinate space range of the previous operation.
  • the lower bound of the coordinate space of the tensor data used by the previous operation or instruction can be used as the upper bound of the tensor data used by the current new instruction. boundary.
  • the upper bound of the coordinate space is the lower bound of the coordinate space of the nearest (ie, the previous operation or the previous operation) for the write operation on the tensor data .
  • the upper bound of the coordinate space is the lower bound of the coordinate space of the most recent write operation on the tensor data and all the tensor data between the two write operations.
  • the minimum value in the lower bound of the coordinate space of the read operation By selecting the minimum value, it can be ensured that the execution of the first operation will not affect the execution of any preceding operation.
  • the second coordinate space range may be determined based on at least one of: an execution range of the operation; an access mode of the operation; and a current execution state of the operation.
  • the above factors can be comprehensively considered to determine the second coordinate space range, so as to ensure that when accessing tensor data according to the dimension, the coordinates on the corresponding dimension are not less than the lower bound of the coordinate space. Further, try to provide the maximum value of the lower bound of the coordinate space, so that the accessible space range for subsequent operations or instructions will be larger.
  • the coordinate space lower bound may be determined based on the minimum access coordinate of the first operation.
  • the coordinate space lower bound may be determined as the fine-grained region in which the smallest access coordinate is located.
  • the X lower bound can be determined as the second fine-grained area.
  • Granularity area when the first operation accesses data according to the Y dimension, assuming that the minimum Y coordinate of the accessed data is B and falls in the third fine-grained area from the top, the Y lower bound can be determined as the third fine-grained area .
  • the coordinate space lower bound may be determined based on the regularity. For example, in the convolution operation, it may be necessary to access the tensor data in blocks, so the lower bound of the coordinate space can be determined according to the block law of the convolution operation.
  • the coordinate space lower bound may be determined based on a predetermined setting.
  • the coordinate space lower bound may be a default value such as 0 or the size of 1 or more fine-grained regions.
  • the first and second coordinate space extents may be determined based on a pre-division of the shape coordinate space of the tensor data.
  • the shape coordinate space of the tensor data may be firstly divided into several spatial blocks, for example, divided uniformly or non-uniformly in each dimension, and each spatial block includes one or more fine-grained regions.
  • the shape coordinate space of tensor data is divided into 6 spatial blocks, each spatial block A, B, C, D, E and F including 2 fine-grained regions.
  • a block of space in the shape coordinate space of the tensor data, where the previous operation has been completed may be determined as the first coordinate space extent; and will be determined based on the coordinates of the tensor data to be accessed by the first operation
  • the space block is determined as the second coordinate space range.
  • the space range that is, the first coordinate space range
  • the space block A and the space block C may include the space block A and the space block C. block B.
  • the spatial range that is, the first operation will be used
  • the two-coordinate space range can be determined as space block A and space block B.
  • the operable range when the first operation is actually performed is the intersection of the first coordinate space range and the second coordinate space range.
  • the third coordinate space extents are space blocks A and B.
  • the first operation may be performed based on at least one order of: a predetermined spatial block order; and/or a predetermined fine-grained region order.
  • a space block sequence can be predetermined, that is, the operation on each space block in the coordinate space The order, for example, in the order of spatial blocks A, B, C, D, E and F.
  • the instructions can operate on the space blocks one by one in that order. For example, assuming that the preceding instruction 1 is to write tensor data, and the following instruction 2 is to read the tensor data, then instruction 1 can first perform a write operation on space block A, and then perform a write operation on space block B.
  • the instruction 2 can start to read the space block A. If the division of space blocks makes the execution rhythm of instruction 2 and instruction 1 consistent, then in the subsequent time, when instruction 1 starts to write space block C, instruction 2 has also completed the read operation of space block A, and starts A read operation on space block B; and so on. It can be seen that the division of space blocks is conducive to the parallel execution of instructions, and the convention of the order of space blocks is conducive to simplifying operation scheduling, shortening processing time, and improving processing efficiency.
  • the first operation when the first operation is performed within a single spatial block, it may also be performed in a predetermined fine-grained region order.
  • this sequential execution of the predetermined fine-grained region is beneficial to simplify the operation scheduling, the principle of which is similar to that described above The principle of executing in the order of predetermined space blocks will not be repeated here.
  • the first and second coordinate space extents may also be determined by combining dynamic determination with pre-partitioning of the tensor data shape coordinate space.
  • the shape coordinate space of the tensor data may be firstly divided into several space blocks, for example, divided uniformly or non-uniformly in each dimension.
  • the first and second coordinate spatial extents may be determined dynamically based on the execution of the operation. For a specific determination manner, reference may be made to the foregoing description, which is not repeated here. In these implementations, when the exact position of the second coordinate space range in a certain space block cannot be determined, the range corresponding to the space block can be defaulted.
  • the pre-division of the shape coordinate space of the tensor data may be performed based on at least one of the following: the processing capability of the hardware; preset parameters; and the size of the shape coordinate space of the tensor data.
  • the processing capability of the hardware may include, but is not limited to, the data bit width that the hardware can process. Based on the data bit width that the hardware can process, the shape coordinate space of the tensor data is divided, which can give full play to the processing capability of the hardware and improve the efficiency of parallel processing.
  • the preset parameters can directly specify the number of space blocks to be divided, the size of each dimension of the space block, and so on.
  • the shape coordinate space of the tensor data is segmented based on the size/dimension of the shape coordinate space of the tensor data. For example, when the tensor data is a two-dimensional matrix, its size is M rows*N columns (M and N are positive integers), then each row can be divided into m parts on average, and each column can be divided into n parts on average, so that A total of m*n space blocks.
  • the six space blocks divided equally are shown in FIG. 4 , they can also be divided into various numbers of space blocks with unequal sizes, and the present disclosure does not limit the specific division manner.
  • the above describes the scheme of constraining the space range actually used by the operation in order to ensure the sequential consistency of data processing and improve the efficiency of parallel processing when the hardware performs operations in parallel.
  • the current operation for example, the aforementioned first operation
  • the previous operation or the previous operation
  • the current operation and the previous operation may also be respectively
  • present disclosure is not limited in this regard to different operations performed in parallel within the same instruction.
  • steps in the method flow chart are displayed in sequence according to the arrows, these steps are not necessarily executed in the sequence indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in the method flow chart may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed and completed at the same time, but may be executed at different times. The order of execution is also not necessarily sequential, but may be performed alternately or alternately with other steps or sub-steps of other steps or at least a portion of a phase.
  • FIG. 5 is a structural diagram illustrating a combined processing apparatus 500 according to an embodiment of the present disclosure.
  • the combined processing device 500 includes a computing processing device 502 , an interface device 504 , other processing devices 506 and a storage device 508 .
  • the computing and processing apparatus may include one or more computing apparatuses 510, and the computing apparatus may be configured as the processing apparatus 200 shown in FIG. 2 for performing the operations described herein in conjunction with FIG. 4 .
  • the computing processing devices of the present disclosure may be configured to perform user-specified operations.
  • the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor.
  • one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core.
  • multiple computing devices are implemented as an artificial intelligence processor core or a part of the hardware structure of an artificial intelligence processor core, for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure.
  • the computing processing apparatus of the present disclosure may interact with other processing apparatuses through an interface apparatus to jointly complete an operation specified by a user.
  • other processing devices of the present disclosure may include central processing units (Central Processing Unit, CPU), graphics processing units (Graphics Processing Unit, GPU), artificial intelligence processors and other general-purpose and/or special-purpose processors.
  • processors may include, but are not limited to, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the computing processing device of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when computing processing devices and other processing devices are considered together, the two can be viewed as forming a heterogeneous multi-core structure.
  • the other processing device may serve as an interface for the computing processing device of the present disclosure (which may be embodied as a related computing device for artificial intelligence such as neural network operations) with external data and control, performing operations including but not limited to Limited to basic controls such as data movement, starting and/or stopping computing devices.
  • other processing apparatuses may also cooperate with the computing processing apparatus to jointly complete computing tasks.
  • the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices.
  • the computing and processing device may obtain input data from other processing devices via the interface device, and write the input data into the on-chip storage device (or memory) of the computing and processing device.
  • the computing and processing device may obtain control instructions from other processing devices via the interface device, and write them into a control cache on the computing and processing device chip.
  • the interface device can also read the data in the storage device of the computing processing device and transmit it to other processing devices.
  • the combined processing device of the present disclosure may also include a storage device.
  • the storage device is connected to the computing processing device and the other processing device, respectively.
  • a storage device may be used to store data of the computing processing device and/or the other processing device.
  • the data may be data that cannot be fully stored in an internal or on-chip storage device of a computing processing device or other processing device.
  • the present disclosure also discloses a chip (eg, chip 602 shown in FIG. 6).
  • the chip is a System on Chip (SoC) and integrates one or more combined processing devices as shown in FIG. 5 .
  • the chip can be connected with other related components through an external interface device (such as the external interface device 606 shown in FIG. 6 ).
  • the relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card or a wifi interface.
  • other processing units such as video codecs
  • interface modules such as DRAM interfaces
  • the present disclosure also discloses a chip package structure including the above-mentioned chip.
  • the present disclosure also discloses a board including the above-mentioned chip package structure. The board will be described in detail below with reference to FIG. 6 .
  • FIG. 6 is a schematic structural diagram illustrating a board 600 according to an embodiment of the present disclosure.
  • the board includes a storage device 604 for storing data, which includes one or more storage units 610 .
  • the storage device can be connected to the control device 608 and the chip 602 described above for connection and data transmission through, for example, a bus.
  • the board also includes an external interface device 606, which is configured for data relay or transfer function between the chip (or a chip in a chip package structure) and an external device 612 (such as a server or a computer, etc.).
  • the data to be processed can be transmitted to the chip by an external device through an external interface device.
  • the calculation result of the chip may be transmitted back to the external device via the external interface device.
  • the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface and the like.
  • control device in the board of the present disclosure may be configured to regulate the state of the chip.
  • control device may include a single-chip microcomputer (Micro Controller Unit, MCU) for regulating the working state of the chip.
  • MCU Micro Controller Unit
  • an electronic device or device which may include one or more of the above-mentioned boards, one or more of the above-mentioned chips and/or one or a plurality of the above-mentioned combined processing devices.
  • the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment.
  • the vehicles include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • the electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal.
  • the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.
  • units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units.
  • the aforementioned components or elements may be co-located or distributed over multiple network elements.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.
  • the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the aspects of the present disclosure are embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure.
  • a computer device eg, a personal computer, a server or network equipment, etc.
  • the aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors.
  • the various types of devices described herein may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • a variable resistance memory Resistive Random Access Memory, RRAM
  • Dynamic Random Access Memory Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • HBM Hybrid Memory Cube
  • ROM and RAM etc.
  • a method of processing comprising:
  • the first operation of the acquisition instruction is an operation on tensor data
  • the shape coordinate space of the tensor data includes at least one fine-grained region, and the fine-grained region includes one or one of the shape coordinate space. Multiple adjacent coordinate points;
  • the first operation is performed when the first fine-grained region does not overlap with the second fine-grained region.
  • the first operation is blocked when the first fine-grained region overlaps the second fine-grained region.
  • the first operation is performed when the data operation ranges do not overlap.
  • Clause 4 The method of Clause 3, wherein determining whether the data manipulation scope of the first operation overlaps the data manipulation scope of the second operation is determined based on at least one of the following:
  • the shape information of the tensor data to be operated on is the shape information of the tensor data to be operated on.
  • the first coordinate space range, the second coordinate space range and the third coordinate space range are characterized by the fine-grained region.
  • a predetermined division of the shape coordinate space of the tensor data is a predetermined division of the shape coordinate space of the tensor data.
  • Clause 7 The method of any of clauses 5-6, wherein the second coordinate space extent is determined based on at least one of the following:
  • a predetermined division of the shape coordinate space of the tensor data is a predetermined division of the shape coordinate space of the tensor data.
  • Determining the first coordinate space extent includes: determining an upper bound on the coordinate space of one or more dimensions of the tensor data that is allowed to be used by the first operation; and/or
  • Determining the second coordinate space extent includes determining a coordinate space lower bound for one or more dimensions of the tensor data expected to be used by the first operation.
  • the size of the shape coordinate space of the tensor data is the size of the shape coordinate space of the tensor data.
  • Clause 10 The method of any of clauses 1-9, wherein at least one of the first operation and the second operation is a write operation.
  • the first operation and the second operation are respectively operations in different instructions executed in parallel; or
  • the first operation and the second operation are respectively different operations performed in parallel in the same instruction.
  • a processing device comprising:
  • An operation obtaining unit configured to obtain a first operation of an instruction, where the first operation is an operation on tensor data, the shape coordinate space of the tensor data includes at least one fine-grained area, and the fine-grained area includes all one or more adjacent coordinate points in the shape coordinate space;
  • a first determining unit configured to determine whether there is an ongoing second operation on the tensor data
  • a second determining unit configured to determine whether the first fine-grained region currently targeted by the first operation and the second fine-grained region currently targeted by the second operation overlap when the second operation exists ;as well as
  • An execution unit configured to execute the first operation when the first fine-grained region does not overlap with the second fine-grained region.
  • a blocking unit configured to block the first operation when the first fine-grained region and the second fine-grained region overlap.
  • a third determination unit configured to determine whether the data operation range of the first operation overlaps the data operation range of the second operation
  • the second determining unit is configured to, when the third determining unit determines that the data operation ranges overlap, perform the determining of the first fine-grained area currently targeted by the first operation and the current second operation. whether there is overlap in the second fine-grained region targeted;
  • the execution unit is configured to execute the first operation when the third determination unit determines that the data operation ranges do not overlap.
  • Clause 15 The processing device of Clause 14, wherein the third determination unit determines whether the data manipulation range of the first operation overlaps the data manipulation range of the second operation based on at least one of the following:
  • the shape information of the tensor data to be operated on is the shape information of the tensor data to be operated on.
  • a first determination subunit configured to determine a first coordinate space range of the tensor data that is allowed to be used by the first operation
  • a second determination subunit configured to determine a second coordinate space range of the tensor data to be used when performing the first operation
  • the execution unit is further configured to execute the first operation within a third coordinate space range determined by the intersection of the first coordinate space range and the second coordinate space range,
  • the first coordinate space range, the second coordinate space range and the third coordinate space range are characterized by the fine-grained region.
  • Clause 17 The processing device of Clause 16, wherein the first coordinate space extent is determined based on at least one of: a sequence of operations;
  • a predetermined division of the shape coordinate space of the tensor data is a predetermined division of the shape coordinate space of the tensor data.
  • Clause 18 The processing device of any of clauses 16-17, wherein the second coordinate space extent is determined based on at least one of:
  • a predetermined division of the shape coordinate space of the tensor data is a predetermined division of the shape coordinate space of the tensor data.
  • the first determination subunit is further configured to: determine an upper bound on the coordinate space of one or more dimensions of the tensor data that is allowed to be used by the first operation; and/or
  • the second determination subunit is further configured to: determine a coordinate space lower bound of one or more dimensions of the tensor data that is expected to be used by the first operation.
  • the size of the shape coordinate space of the tensor data is the size of the shape coordinate space of the tensor data.
  • Clause 21 The processing device of any of clauses 12-20, wherein at least one of the first operation and the second operation is a write operation.
  • the first operation and the second operation are respectively operations in different instructions executed in parallel; or
  • the first operation and the second operation are respectively different operations performed in parallel in the same instruction.
  • Clause 23 A chip comprising the processing device of any of clauses 12-22.

Abstract

A processing method, a processing apparatus, and a related product. The processing apparatus can be implemented as a computing apparatus (510) included in a combined processing apparatus (500); the combined processing apparatus (500) further can comprise an interface apparatus (504) and other processing apparatuses (506); the computing apparatus (510) interacts with other processing apparatuses (506) so as to jointly complete a computing operation specified by a user; the combined processing apparatus (500) further can comprise a storage apparatus (508); and the storage apparatus (508) is separately connected to the computing apparatus (510) and other processing apparatuses (506) and used for storing the data of the computing apparatus (510) and other processing apparatuses (506). The method provides an instruction parallel solution that can improve instruction parallelism, thereby improving the processing efficiency of a machine.

Description

处理方法、处理装置及相关产品Processing method, processing device and related products
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2020年11月13日申请的,申请号为2020112703785,名称为“处理方法、处理装置及相关产品”的中国专利申请的优先权。This application claims the priority of the Chinese patent application filed on November 13, 2020, with the application number 2020112703785 and titled "Processing Method, Processing Device and Related Products".
技术领域technical field
本披露涉及处理器领域,特别是涉及一种处理方法、处理装置、芯片和板卡。The present disclosure relates to the field of processors, and in particular, to a processing method, a processing device, a chip and a board.
背景技术Background technique
指令系统是计算机软件和硬件交互的接口,是计算机系统结构中一个非常重要的部分。随着人工智能技术的不断发展,需要处理的数据量和数据维度都在不断增大。因此,如何合理、科学地控制指令的执行,尤其是提高指令并行的程度,提高机器的性能,这是处理器领域中的一个重要问题。The instruction system is the interface between computer software and hardware interaction, and is a very important part of the computer system structure. With the continuous development of artificial intelligence technology, the amount of data and data dimensions that need to be processed are constantly increasing. Therefore, how to control the execution of instructions reasonably and scientifically, especially to improve the degree of parallelism of instructions and improve the performance of the machine, is an important issue in the field of processors.
发明内容SUMMARY OF THE INVENTION
为了解决如上所提到的一个或多个技术问题,本披露在多个方面中提出了增强指令并行的解决方案。通过本披露的指令系统,可以提高指令并行程度,由此提高机器的处理效率。In order to solve one or more of the technical problems mentioned above, the present disclosure proposes solutions to enhance instruction parallelism in various aspects. Through the instruction system of the present disclosure, the degree of instruction parallelism can be improved, thereby improving the processing efficiency of the machine.
在第一方面中,本披露提供一种处理方法,所述方法包括:获取指令的第一操作,所述第一操作为针对张量数据的操作,所述张量数据的形状坐标空间包括至少一个细粒度区域,所述细粒度区域包括所述形状坐标空间的一个或多个相邻坐标点;确定是否存在正在进行的针对所述张量数据的第二操作;在存在所述第二操作时,确定所述第一操作当前所针对的第一细粒度区域与所述第二操作当前所针对的第二细粒度区域是否存在重叠;以及在所述第一细粒度区域与所述第二细粒度区域不重叠时,执行所述第一操作。In a first aspect, the present disclosure provides a processing method, the method comprising: obtaining a first operation of an instruction, where the first operation is an operation on tensor data, the shape coordinate space of the tensor data includes at least a fine-grained region, the fine-grained region including one or more adjacent coordinate points of the shape coordinate space; determining whether there is an ongoing second operation on the tensor data; in the presence of the second operation When , determine whether the first fine-grained area currently targeted by the first operation overlaps with the second fine-grained area currently targeted by the second operation; When the fine-grained regions do not overlap, the first operation is performed.
在第二方面中,本披露提供一种处理装置,包括:操作获取单元,配置用于获取指令的第一操作,所述第一操作为针对张量数据的操作,所述张量数据的形状坐标空间包括至少一个细粒度区域,所述细粒度区域包括所述形状坐标空间的一个或多个相邻坐标点;第一确定单元,配置用于确定是否存在正在进行的针对所述张量数据的第二操作;第二确定单元,配置用于在存在所述第二操作时,确定所述第一操作当前所针对的第一细粒度区域与所述第二操作当前所针对的第二细粒度区域是否存在重叠;以及执行单元,配置用于在所述第一细粒度区域与所述第二细粒度区域不重叠时,执行所述第一操作。In a second aspect, the present disclosure provides a processing device, comprising: an operation acquisition unit configured to acquire a first operation of an instruction, the first operation being an operation on tensor data, the shape of the tensor data The coordinate space includes at least one fine-grained area, and the fine-grained area includes one or more adjacent coordinate points of the shape coordinate space; a first determination unit is configured to determine whether there is an ongoing process for the tensor data the second operation; the second determination unit is configured to, when the second operation exists, determine the first fine-grained region currently targeted by the first operation and the second fine-grained region currently targeted by the second operation whether the granular regions overlap; and an execution unit configured to perform the first operation when the first fine-grained region does not overlap with the second fine-grained region.
在第三方面中,本披露提供一种芯片,包括前述第二方面任一实施例的处理装置。In a third aspect, the present disclosure provides a chip including the processing device of any embodiment of the foregoing second aspect.
在第四方面中,本披露提供一种板卡,包括前述第三方面任一实施例的芯片。In a fourth aspect, the present disclosure provides a board including the chip of any embodiment of the foregoing third aspect.
通过如上所提供的处理装置、处理方法、芯片和板卡,本披露实施例在指令的操作执行过程中,基于操作所针对的张量数据的形状坐标空间的细粒度区域来对操作的并行进行限制,从而可以挖掘操作的并行执行潜力。因此,按照本披露的实施例,在硬件的并行执行时,既能保证执行顺序的一致性,又能提高操作的并行程度,由此确保了处理的准确和效率。Through the processing device, processing method, chip, and board provided above, the embodiments of the present disclosure perform operations in parallel based on the fine-grained region of the shape coordinate space of the tensor data targeted by the operation during the execution of the operation of the instruction. limit, so that the parallel execution potential of the operation can be exploited. Therefore, according to the embodiments of the present disclosure, during the parallel execution of hardware, the consistency of the execution order can be ensured, and the degree of parallelism of operations can be improved, thereby ensuring the accuracy and efficiency of processing.
附图说明Description of drawings
通过参考附图阅读下文的详细描述,本披露示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本披露的若干实施方式,并且相同或对应的标号表示相同或对应的部分其中:The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the accompanying drawings, several embodiments of the present disclosure are shown by way of example and not limitation, and like or corresponding reference numerals refer to like or corresponding parts wherein:
图1A示出根据本披露实施例的数据存储空间的示意图;1A shows a schematic diagram of a data storage space according to an embodiment of the present disclosure;
图1B示出根据本披露实施例的数据分块在数据存储空间中的示意图;1B shows a schematic diagram of data partitioning in a data storage space according to an embodiment of the present disclosure;
图2示出根据本披露实施例的处理装置的示意性框图;FIG. 2 shows a schematic block diagram of a processing apparatus according to an embodiment of the present disclosure;
图3A-3C示出根据本披露实施例的处理方法的示意性流程图;3A-3C show schematic flowcharts of a processing method according to an embodiment of the present disclosure;
图3D示出根据本披露实施例的处理装置的示意性框图;3D shows a schematic block diagram of a processing apparatus according to an embodiment of the present disclosure;
图4示出根据本披露实施例的坐标空间范围的示意图;4 shows a schematic diagram of a coordinate space range according to an embodiment of the present disclosure;
图5示出根据本披露实施例的一种组合处理装置的结构图;以及FIG. 5 shows a structural diagram of a combined processing apparatus according to an embodiment of the present disclosure; and
图6示出根据本披露实施例的一种板卡的结构示意图。FIG. 6 shows a schematic structural diagram of a board according to an embodiment of the present disclosure.
具体实施方式Detailed ways
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present disclosure.
应当理解,本披露的权利要求、说明书及附图中可能使用的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that the terms "first", "second", "third" and "fourth" that may be used in the claims, description and drawings of the present disclosure are used to distinguish different objects, rather than to describe specific order. The terms "comprising" and "comprising" as used in the specification and claims of this disclosure indicate the presence of the described features, integers, steps, operations, elements and/or components, but do not exclude one or more other features, integers , step, operation, element, component and/or the presence or addition of a collection thereof.
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this disclosure and the claims, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise. It should further be understood that, as used in this disclosure and the claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in this specification and in the claims, the term "if" may be contextually interpreted as "when" or "once" or "in response to determining" or "in response to detecting". Similarly, the phrases "if it is determined" or "if the [described condition or event] is detected" may be interpreted, depending on the context, to mean "once it is determined" or "in response to the determination" or "once the [described condition or event] is detected. ]" or "in response to detection of the [described condition or event]".
计算机通过执行指令来处理各种数据。为了指出数据的来源、操作结果的去向及所执行的操作,一条指令通常包含下列信息:Computers process various data by executing instructions. In order to indicate the source of the data, the destination of the operation result and the operation performed, an instruction usually contains the following information:
(1)操作码(Operation Code,OP),用来表示该指令所要完成的操作(例如,加、减、乘、除、数据传送等),它具体说明了操作的性质及功能。一台计算机可能有几十条至几百条指令,每一条指令都有一个相应的操作码,计算机通过识别该操作码来完成不同的操作。(1) Operation Code (OP), which is used to indicate the operation to be completed by the instruction (for example, addition, subtraction, multiplication, division, data transfer, etc.), which specifies the nature and function of the operation. A computer may have dozens to hundreds of instructions, each instruction has a corresponding opcode, and the computer completes different operations by identifying the opcode.
(2)操作数,用来描述该指令的操作对象。操作数可以涉及被操作对象的数据类型、访存地址、寻址方式等。操作数可以直接给出被操作对象,或者指出被操作对象的存储器地址或寄存器地址(即寄存器名)。(2) Operand, which is used to describe the operation object of the instruction. The operand can relate to the data type, memory access address, addressing mode, etc. of the object being operated. The operand can directly give the operated object, or point out the memory address or register address (ie register name) of the operated object.
传统的处理器的指令被设计为能够执行基本的单数据标量操作。这里,单数据标量操作指的是指令的每一个操作数都是一个标量数据。然而,随着人工智能技术的发展,在诸如图像处理和模式识别等的任务中,面向的操作数往往是多维向量(即,张量数据)的数据类型,仅仅使用标量操作无法使硬件高效地完成运算任务。因此,如何高效地执行多维的张量数据处理也是当前计算领域亟需解决的问题。The instructions of conventional processors are designed to perform basic single-data scalar operations. Here, a single-data scalar operation means that each operand of the instruction is a scalar data. However, with the development of artificial intelligence technology, in tasks such as image processing and pattern recognition, the oriented operands are often multi-dimensional vector (ie, tensor data) data types, and only using scalar operations cannot make hardware efficient Complete the operation task. Therefore, how to efficiently perform multi-dimensional tensor data processing is also an urgent problem to be solved in the current computing field.
在本披露的实施例中,提供了一种指令系统,其中在指令的操作数中包括描述符,通过该描述符可以获取与张量数据相关的信息。具体地,描述符可以指示以下至少一项信息:张量数据的形状信息、张量数据的空间信息。张量数据的形状信息可以用于确定与该操作数对应的张量数据在数据存储空间中的数据地址。张量数据的空间信息可以用于确定指令之间的依赖关系,进而可以确定例如指令的执行顺序。张量数据的空间信息可以通过空间标识(ID)来指示。空间ID也可以称为空间别名,其指代用于存储对应的张量数据的一个空间区域,该空间区域可以是一段连续的空间,也可以是多段空间,本披露对于空间区域的具体组成形式没有限制。不同的空间ID表示所指向的空间区域不存在依赖关系。例如,可以通过使得不同的空间ID所指向的空间区域相互不重叠来确保不存在依赖关系。In an embodiment of the present disclosure, an instruction system is provided in which a descriptor is included in an operand of the instruction, through which information related to tensor data can be obtained. Specifically, the descriptor may indicate at least one of the following information: shape information of tensor data, and spatial information of tensor data. The shape information of the tensor data can be used to determine the data address in the data storage space of the tensor data corresponding to the operand. Spatial information of tensor data can be used to determine dependencies between instructions, which in turn can determine, for example, the execution order of instructions. The spatial information of tensor data can be indicated by a spatial identification (ID). A space ID can also be called a space alias, which refers to a space area used to store the corresponding tensor data. The space area can be a continuous space or multiple space. This disclosure does not have any specific composition of the space area. limit. Different spatial IDs indicate that there is no dependency between the pointed spatial regions. For example, it can be ensured that no dependencies exist by making the spatial regions pointed to by different spatial IDs not overlapped with each other.
下面将结合附图详细描述张量数据的形状信息的各种可能实现方式。Various possible implementations of the shape information of tensor data will be described in detail below with reference to the accompanying drawings.
张量可以包含多种形式的数据组成方式。张量可以是不同维度的,比如标量可以看作是0维张量,向量可以看作1维张量,而矩阵可以是2维或2维以上的张量。张量的形状包括张量的维度、张量各个维度的尺寸等信息。举例而言,对于三维张量:Tensors can contain many forms of data composition. Tensors can be of different dimensions. For example, scalars can be regarded as 0-dimensional tensors, vectors can be regarded as 1-dimensional tensors, and matrices can be 2-dimensional or more than 2-dimensional tensors. The shape of a tensor includes information such as the dimension of the tensor and the dimensions of each dimension of the tensor. For example, for a three-dimensional tensor:
x 3=[[[1,2,3],[4,5,6]];[[7,8,9],[10,11,12]]] x 3 = [[[1, 2, 3], [4, 5, 6]]; [[7, 8, 9], [10, 11, 12]]]
该张量的形状或维度可以表示为X 3=(2,2,3),也即通过三个参数表示该张量为三维张量,且该张量的第一维度的尺寸为2、第二维度的尺寸为2、而第三维度的尺寸为3。在存储器中存储张量数据时,根据其数据地址(或存储区域)无法确定张量数据的形状,进而也无法确定多个张量数据之间相互关系等相关信息,导致处理器对张量数据的存取效率较低。 The shape or dimension of the tensor can be expressed as X 3 =(2,2,3), that is, three parameters indicate that the tensor is a three-dimensional tensor, and the size of the first dimension of the tensor is 2, the third The dimension of the second dimension is 2 and the dimension of the third dimension is 3. When storing tensor data in the memory, the shape of the tensor data cannot be determined according to its data address (or storage area), and further related information such as the relationship between multiple tensor data cannot be determined, resulting in the processor to the tensor data. access efficiency is low.
在一种可能的实现方式中,可以用描述符指示N维的张量数据的形状,N为正整数,例如N=1、2或3,或者为零。上面示例中的三维张量用描述符可以表示为(2,2,3)。需要说明的是,本披露对于描述符指示张量形状的方式没有限制。In a possible implementation, the shape of the N-dimensional tensor data can be indicated by a descriptor, where N is a positive integer, such as N=1, 2, or 3, or zero. The three-dimensional tensor in the example above can be represented as (2, 2, 3) with descriptors. It should be noted that the present disclosure does not limit the manner in which the descriptor indicates the shape of the tensor.
在一种可能的实现方式中,N的取值可以根据张量数据的维数(也称为阶数)来确定,也可以根据张量数据的使用需要进行设定。例如,在N的取值为3时,张量数据为三维的张量数据,描述符可用来指示该三维的张量数据在三个维度方向上的形状(例如偏移量、尺寸等)。应当理解,本领域技术人员可以根据实际需要对N的取值进行设置,本披露对此不作限制。In a possible implementation manner, the value of N may be determined according to the dimension (also referred to as the order) of the tensor data, and may also be set according to the usage needs of the tensor data. For example, when the value of N is 3, the tensor data is three-dimensional tensor data, and the descriptor can be used to indicate the shape (eg offset, size, etc.) of the three-dimensional tensor data in the three-dimensional direction. It should be understood that those skilled in the art can set the value of N according to actual needs, which is not limited in the present disclosure.
虽然张量数据可以是多维的,但是因为存储器的布局始终是一维的,因此张量与存储器上的存储之间存在对应关系。张量数据通常被分配在连续的存储空间中,也即可以将张量数据进行一维展开(例如,行优先方式),存储在存储器上。Although tensor data can be multi-dimensional, because the layout of memory is always one-dimensional, there is a correspondence between tensors and storage on memory. Tensor data is usually allocated in contiguous storage space, that is, the tensor data can be expanded one-dimensionally (eg, row-major manner) and stored on the memory.
张量与底层存储之间的这种关系可以通过维度的偏移量(offset)、维度的尺寸(size)、维度的步长(stride)等来表示。维度的偏移量是指在该维度上相对参考位置的偏移。维度的尺寸是指该维度的大小,也即该维度上元素的个数。维度的步长指的是在该维度下,相邻元素之间的间隔,例如上面三维张量的步长为(6,3,1),也即第一维的步长是6,第二维的步长是3,第三维的步长是1。This relationship between tensors and the underlying storage can be represented by the offset of the dimension (offset), the size of the dimension (size), the stride of the dimension (stride), and so on. The offset of a dimension refers to the offset relative to the reference position in that dimension. The size of a dimension refers to the size of the dimension, that is, the number of elements in the dimension. The step size of the dimension refers to the interval between adjacent elements in this dimension. For example, the step size of the three-dimensional tensor above is (6,3,1), that is, the step size of the first dimension is 6, and the second step size is 6. The step size of the dimension is 3, and the step size of the third dimension is 1.
图1A示出根据本披露实施例的数据存储空间的示意图。如图1A所示,数据存储空间21采用行优先的方式存储了一个二维数据,可通过(x,y)来表示(其中,X轴水平向右,Y轴垂直向下)。X轴方向上的尺寸(每行的尺寸,或总列数)为ori_x(图中未示出),Y轴方向上的尺寸(总行数)为ori_y(图中未示出),数据存储空间21的起始地址PA_start(基准地址)为第一个数据块22的物理地址。数据块23是数据存储空间21中的部分数据,其在X轴方向上的偏移量25表示为offset_x,在Y轴方向上的偏移量24表示为offset_y,在X轴方向上的尺寸表示为size_x,在Y轴方向上的尺寸表示为size_y。FIG. 1A shows a schematic diagram of a data storage space according to an embodiment of the present disclosure. As shown in FIG. 1A , the data storage space 21 stores a two-dimensional data in a row-first manner, which can be represented by (x, y) (where the X axis is horizontally to the right, and the Y axis is vertically downward). The size in the X-axis direction (the size of each row, or the total number of columns) is ori_x (not shown in the figure), the size in the Y-axis direction (the total number of rows) is ori_y (not shown in the figure), and the data storage space The starting address PA_start (reference address) of 21 is the physical address of the first data block 22 . The data block 23 is part of the data in the data storage space 21, the offset 25 in the X-axis direction is represented as offset_x, the offset 24 in the Y-axis direction is represented as offset_y, and the size in the X-axis direction is represented by is size_x, and the size in the Y-axis direction is represented by size_y.
在一种可能的实现方式中,使用描述符来定义数据块23时,描述符的数据基准点可以使用数据存储空间21的第一个数据块,可以约定描述符的基准地址为数据存储空间21的起始地址PA_start。然后可以结合数据存储空间21在X轴的尺寸ori_x、在Y轴上的尺寸ori_y,以及数据块23在Y轴方向的偏移量offset_y、X轴方向上的偏移量offset_x、X轴方向上的尺寸size_x以及Y轴方向上的尺寸size_y来确定数据块23的描述符的内容。In a possible implementation manner, when the descriptor is used to define the data block 23, the data reference point of the descriptor may use the first data block of the data storage space 21, and the reference address of the descriptor may be agreed as the data storage space 21 The starting address of PA_start. Then the size ori_x on the X axis, the size ori_y on the Y axis of the data storage space 21, and the offset amount offset_y in the Y axis direction, the offset amount offset_x in the X axis direction, and the offset amount in the X axis direction of the data block 23 can be combined. The content of the descriptor of the data block 23 is determined by the size size_x and the size size_y in the Y-axis direction.
在一种可能的实现方式中,可以使用下述公式(1)来表示描述符的内容:In a possible implementation, the following formula (1) can be used to represent the content of the descriptor:
Figure PCTCN2021123552-appb-000001
Figure PCTCN2021123552-appb-000001
应当理解,虽然上述示例中,描述符的内容表示的是二维空间,但本领域技术人员可以根据实际情况对描述符的内容表示的具体维度进行设置,本披露对此不作限制。It should be understood that although in the above example, the content of the descriptor represents a two-dimensional space, those skilled in the art can set the specific dimension of the content of the descriptor according to the actual situation, which is not limited in the present disclosure.
在一种可能的实现方式中,可以约定描述符的数据基准点在数据存储空间中的基准地址,在基准地址的基础上,根据处于N个维度方向的对角位置的至少两个顶点相对于数据基准点的位置,确定张量数据的描述符的内容。In a possible implementation manner, the reference address of the data reference point of the descriptor in the data storage space may be agreed upon. The location of the data datum point, which determines the content of the descriptor of the tensor data.
举例来说,可以约定描述符的数据基准点在数据存储空间中的基准地址PA_base。例如,可以在数据存储空间21中选取一个数据(例如,位置为(2,2)的数据)作为数据基准点,将该数据在数据存储空间中的物理地址作为基准地址PA_base。可以根据对角位置的两个顶点相对于数据基准点的位置,确定出图1A中数据块23的描述符的内容。首先,确定数据块23的对角位置的至少两个顶点相对于数据基准点的位置,例如,使用左上至右下方向的对角位置顶点相对于数据基准点的位置,其中,左上角顶点的相对位置为(x_min,y_min),右下角顶点的相对位置为(x_max,y_max),然后可以根据基准地址PA_base、左上角顶点的相对位置(x_min,y_min)以及右下角顶点的相对位置(x_max,y_max) 确定出数据块23的描述符的内容。For example, the base address PA_base of the data base point of the descriptor in the data storage space may be agreed. For example, one piece of data (for example, the data at the position (2, 2)) may be selected in the data storage space 21 as the data reference point, and the physical address of the data in the data storage space may be used as the reference address PA_base. The content of the descriptor of the data block 23 in FIG. 1A can be determined according to the positions of the two diagonal vertices relative to the data reference point. First, the positions of at least two vertices of the diagonal positions of the data block 23 relative to the data reference point are determined, for example, the positions of the diagonal position vertices relative to the data reference point in the upper left to lower right direction are used, wherein the upper left corner vertex is The relative position is (x_min, y_min), the relative position of the lower right vertex is (x_max, y_max), and then the relative position of the upper left vertex (x_min, y_min) and the relative position of the lower right vertex (x_max, y_max) determines the content of the descriptor of the data block 23.
在一种可能的实现方式中,可以使用下述公式(2)来表示描述符的内容(基准地址为PA_base):In a possible implementation, the following formula (2) can be used to represent the content of the descriptor (the base address is PA_base):
Figure PCTCN2021123552-appb-000002
Figure PCTCN2021123552-appb-000002
应当理解,虽然上述示例中使用左上角和右下角两个对角位置的顶点来确定描述符的内容,但本领域技术人员可以根据实际需要对对角位置的至少两个顶点的具体顶点进行设置,本披露对此不作限制。It should be understood that, although the vertices in the upper left corner and the lower right corner are used to determine the content of the descriptor in the above example, those skilled in the art can set the specific vertices of the at least two vertices in the diagonal positions according to actual needs. , this disclosure does not limit this.
在一种可能的实现方式中,可根据描述符的数据基准点在数据存储空间中的基准地址,以及描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系,确定张量数据的描述符的内容。其中,数据描述位置与数据地址之间的映射关系可以根据实际需要进行设定,例如,描述符所指示的张量数据为三维空间数据时,可以使用函数f(x,y,z)来定义数据描述位置与数据地址之间的映射关系。In a possible implementation manner, the tensor can be determined according to the reference address of the data reference point of the descriptor in the data storage space and the mapping relationship between the data description position and the data address of the tensor data indicated by the descriptor. The content of the descriptor of the quantity data. Among them, the mapping relationship between the data description position and the data address can be set according to actual needs. For example, when the tensor data indicated by the descriptor is three-dimensional space data, the function f(x, y, z) can be used to define The data describes the mapping relationship between the location and the data address.
在一种可能的实现方式中,可以使用下述公式(3)来表示描述符的内容:In a possible implementation, the following formula (3) can be used to represent the content of the descriptor:
Figure PCTCN2021123552-appb-000003
Figure PCTCN2021123552-appb-000003
在一种可能的实现方式中,描述符还用于指示N维的张量数据的地址,其中,描述符的内容还包括表示张量数据的地址的至少一个地址参数,例如描述符的内容可以是下式(4):In a possible implementation manner, the descriptor is further used to indicate the address of N-dimensional tensor data, wherein the content of the descriptor further includes at least one address parameter representing the address of the tensor data, for example, the content of the descriptor may be is the following formula (4):
Figure PCTCN2021123552-appb-000004
Figure PCTCN2021123552-appb-000004
其中PA为地址参数。地址参数可以是逻辑地址,也可以是物理地址。在对描述符进行解析时可以以PA为向量形状的顶点、中间点或预设点中的任意一个,结合X方向和Y方向的形状参数得到对应的数据地址。Where PA is the address parameter. The address parameter can be a logical address or a physical address. When parsing the descriptor, PA can be used as any one of the vertex, middle point or preset point of the vector shape, and the corresponding data address can be obtained by combining the shape parameters in the X direction and the Y direction.
在一种可能的实现方式中,张量数据的地址参数包括描述符的数据基准点在该张量数据的数据存储空间中的基准地址,基准地址包括该数据存储空间的起始地址。In a possible implementation manner, the address parameter of the tensor data includes a reference address of the data reference point of the descriptor in the data storage space of the tensor data, and the reference address includes a start address of the data storage space.
在一种可能的实现方式中,描述符还可以包括表示张量数据的地址的至少一个地址参数,例如描述符的内容可以是下式(5):In a possible implementation manner, the descriptor may further include at least one address parameter representing the address of the tensor data, for example, the content of the descriptor may be the following formula (5):
Figure PCTCN2021123552-appb-000005
Figure PCTCN2021123552-appb-000005
其中PA_start为基准地址参数,不再赘述。Among them, PA_start is a reference address parameter, which is not repeated here.
应当理解,本领域技术人员可以根据实际情况对数据描述位置与数据地址之间的映射关系进行设定,本披露对此不作限制。It should be understood that those skilled in the art can set the mapping relationship between the data description location and the data address according to the actual situation, which is not limited in the present disclosure.
在一种可能的实现方式中,可以在一个任务中设定约定的基准地址,此任务下指令中的描述符均使用此基准地址,描述符内容中可以包括基于此基准地址的形状参数。可以通过设定此任务的环境参数的方式确定此基准地址。基准地址的相关描述和使用方式可参见上述实施例。此种实现方式下,描述符的内容可以更快速地被映射为数据地址。In a possible implementation manner, a predetermined reference address may be set in a task, the descriptors in the instructions under this task all use the reference address, and the content of the descriptor may include shape parameters based on the reference address. The base address can be determined by setting the environment parameters for this task. For the relevant description and usage of the reference address, reference may be made to the foregoing embodiments. In this implementation, the content of the descriptor can be mapped to the data address more quickly.
在一种可能的实现方式中,可以在各描述符的内容中包含基准地址,则各描述符的基准地址可不同。相对于利用环境参数设定共同的基准地址的方式,此种方式中的各描述符可以更加灵活地描述数据,并使用更大的数据地址空间。In a possible implementation manner, a reference address may be included in the content of each descriptor, and the reference address of each descriptor may be different. Compared with the way of setting a common reference address by using environment parameters, each descriptor in this way can describe data more flexibly and use a larger data address space.
在一种可能的实现方式中,可根据描述符的内容,确定与处理指令的操作数对应的数据在数据存储空间中的数据地址。其中,数据地址的计算由硬件自动完成,且描述符的内容的表示方式不同时,数 据地址的计算方法也会不同。本披露对数据地址的具体计算方法不作限制。In a possible implementation manner, the data address in the data storage space of the data corresponding to the operand of the processing instruction may be determined according to the content of the descriptor. Among them, the calculation of the data address is automatically completed by the hardware, and when the representation of the content of the descriptor is different, the calculation method of the data address will also be different. This disclosure does not limit the specific calculation method of the data address.
例如,操作数中描述符的内容是使用公式(1)表示的,描述符所指示的张量数据在数据存储空间中的偏移量分别为offset_x和offset_y,尺寸为size_x*size_y,那么,该描述符所指示的张量数据在数据存储空间中的起始数据地址PA1 (x,y)可以使用下述公式(6)来确定: For example, the content of the descriptor in the operand is represented by formula (1), the offsets of the tensor data indicated by the descriptor in the data storage space are offset_x and offset_y respectively, and the size is size_x*size_y, then, the The starting data address PA1 (x, y) of the tensor data indicated by the descriptor in the data storage space can be determined using the following formula (6):
PA1 (x,y)=PA_start+(offset_y-1)*ori_x+offset_x  (6) PA1 (x,y) =PA_start+(offset_y-1)*ori_x+offset_x(6)
根据上述公式(6)确定的数据起始地址PA1 (x,y),结合偏移量offset_x和offset_y,以及存储区域的尺寸size_x和size_y,可确定出描述符所指示的张量数据在数据存储空间中的存储区域。 According to the data starting address PA1 (x,y ) determined by the above formula (6), combined with the offset offset_x and offset_y, and the size_x and size_y of the storage area, it can be determined that the tensor data indicated by the descriptor is stored in the data storage area. storage area in space.
在一种可能的实现方式中,当操作数还包括针对描述符的数据描述位置时,可根据描述符的内容以及数据描述位置,确定操作数对应的数据在数据存储空间中的数据地址。通过这种方式,可以对描述符所指示的张量数据中的部分数据(例如一个或多个数据)进行处理。In a possible implementation manner, when the operand further includes a data description location for the descriptor, the data address in the data storage space of the data corresponding to the operand can be determined according to the content of the descriptor and the data description location. In this way, part of the data (eg, one or more data) in the tensor data indicated by the descriptor can be processed.
例如,操作数中描述符的内容是使用公式(2)表示的,描述符所指示的张量数据在数据存储空间中偏移量分别为offset_x和offset_y,尺寸为size_x*size_y,操作数中包括的针对描述符的数据描述位置为(x q,y q),那么,该描述符所指示的张量数据在数据存储空间中的数据地址PA2 (x,y)可以使用下述公式(7)来确定: For example, the content of the descriptor in the operand is represented by formula (2). The offsets of the tensor data indicated by the descriptor in the data storage space are offset_x and offset_y respectively, and the size is size_x*size_y. The operand includes The data description position for the descriptor is (x q , y q ), then, the data address PA2 (x, y) of the tensor data indicated by the descriptor in the data storage space can use the following formula (7) to make sure:
PA2 (x,y)=PA_start+(offset_y+y q-1)*ori_x+(offset_x+x q)  (7) PA2 (x,y) = PA_start+(offset_y+y q -1)*ori_x+(offset_x+x q ) (7)
在一种可能的实现方式中,描述符可以指示分块的数据。数据分块在很多应用中可以有效地加快运算速度,提高处理效率。例如,在图形处理中,卷积运算经常使用数据分块进行快速运算处理。In one possible implementation, the descriptor may indicate chunked data. Data block can effectively speed up the operation speed and improve the processing efficiency in many applications. For example, in graphics processing, convolution operations often use data blocks for fast processing.
图1B示出根据本披露实施例的数据分块在数据存储空间中的示意图。如图1B所示,数据存储空间26同样采用行优先的方式存储二维数据,可通过(x,y)来表示(其中,X轴水平向右,Y轴垂直向下)。X轴方向上的尺寸(每行的尺寸,或总列数)为ori_x(图中未示出),Y轴方向上的尺寸(总行数)为ori_y(图中未示出)。不同于图1A的张量数据,图1B中存储的张量数据包括多个数据分块。FIG. 1B shows a schematic diagram of a data block in a data storage space according to an embodiment of the present disclosure. As shown in FIG. 1B , the data storage space 26 also stores two-dimensional data in a row-first manner, which can be represented by (x, y) (where the X axis is horizontally to the right, and the Y axis is vertically downward). The size in the X-axis direction (the size of each row, or the total number of columns) is ori_x (not shown in the figure), and the size in the Y-axis direction (the total number of rows) is ori_y (not shown in the figure). Unlike the tensor data of FIG. 1A , the tensor data stored in FIG. 1B includes multiple data blocks.
在这种情况下,描述符需要更多的参数来表示这些数据分块。以X轴(X维度)为例,可以涉及如下参数:ori_x,x.tile.size(分块中的尺寸27),x.tile.stride(分块中的步长28,即第一个小块的第一个点与第二个小块的第一个点的距离),x.tile.num(分块数量,图1B中示出为3个分块),x.stride(整体的步长,即第一行的第一个点到第二行第一个点的距离)等。其他维度可以类似地包括对应的参数。In this case, the descriptor requires more parameters to represent these data chunks. Taking the X axis (X dimension) as an example, the following parameters can be involved: ori_x, x.tile.size (size 27 in the block), x.tile.stride (step size 28 in the block, that is, the first small distance between the first point of the block and the first point of the second tile), x.tile.num (the number of tiles, shown as 3 tiles in Figure 1B), x.stride (the overall stride length, that is, the distance from the first point of the first row to the first point of the second row) and so on. Other dimensions may similarly include corresponding parameters.
在一种可能的实现方式中,描述符可以包括描述符的标识和/或描述符的内容。其中,描述符的标识用于对描述符进行区分,例如描述符的标识可以为其编号;描述符的内容可以包括表示张量数据的形状的至少一个形状参数。例如,张量数据为3维数据,在该张量数据的三个维度中,其中两个维度的形状参数固定不变,其描述符的内容可包括表示该张量数据的另一个维度的形状参数。In one possible implementation, the descriptor may include the identifier of the descriptor and/or the content of the descriptor. The identifier of the descriptor is used to distinguish the descriptor, for example, the identifier of the descriptor may be numbered; the content of the descriptor may include at least one shape parameter representing the shape of the tensor data. For example, the tensor data is 3-dimensional data. Among the three dimensions of the tensor data, the shape parameters of two dimensions are fixed, and the content of the descriptor may include the shape representing the other dimension of the tensor data. parameter.
在一种可能的实现方式中,与各描述符对应的数据存储空间的数据地址可以是固定地址。例如,可以为张量数据划分单独的数据存储空间,每个张量数据在数据存储空间的起始地址与描述符一一对应。在这种情况下,负责对计算指令进行解析的电路或模块(例如本披露计算装置外部的实体)可以根据描述符来确定与操作数对应的数据在数据存储空间中的数据地址。In a possible implementation manner, the data address of the data storage space corresponding to each descriptor may be a fixed address. For example, a separate data storage space can be divided for tensor data, and the starting address of each tensor data in the data storage space corresponds to a descriptor one-to-one. In this case, the circuit or module responsible for parsing the computing instruction (eg, an entity external to the computing device of the present disclosure) can determine the data address in the data storage space of the data corresponding to the operand according to the descriptor.
在一种可能的实现方式中,在与描述符对应的数据存储空间的数据地址为可变地址时,描述符还可用于指示N维的张量数据的地址,其中,描述符的内容还可包括表示张量数据的地址的至少一个地址参数。例如,张量数据为3维数据,在描述符指向该张量数据的地址时,描述符的内容可包括表示该张量数据的地址的一个地址参数,例如张量数据的起始物理地址,也可以包括该张量数据的地址的多个地址参数,例如张量数据的起始地址+地址偏移量,或张量数据基于各维度的地址参数。本领域技术人员可以根据实际需要对地址参数进行设置,本披露对此不作限制。In a possible implementation manner, when the data address of the data storage space corresponding to the descriptor is a variable address, the descriptor can also be used to indicate the address of N-dimensional tensor data, wherein the content of the descriptor can also be Include at least one address parameter representing the address of the tensor data. For example, if the tensor data is 3-dimensional data, when the descriptor points to the address of the tensor data, the content of the descriptor may include an address parameter indicating the address of the tensor data, such as the starting physical address of the tensor data, It may also include multiple address parameters of the address of the tensor data, such as the start address + address offset of the tensor data, or address parameters of the tensor data based on each dimension. Those skilled in the art can set address parameters according to actual needs, which are not limited in the present disclosure.
在一种可能的实现方式中,张量数据的地址参数可以包括描述符的数据基准点在该张量数据的数据存储空间中的基准地址。其中,基准地址可根据数据基准点的变化而不同。本披露对数据基准点的选取不作限制。In a possible implementation manner, the address parameter of the tensor data may include the reference address of the data reference point of the descriptor in the data storage space of the tensor data. Among them, the reference address can be different according to the change of the data reference point. This disclosure does not limit the selection of data benchmarks.
在一种可能的实现方式中,基准地址可以包括数据存储空间的起始地址。在描述符的数据基准点是数据存储空间的第一个数据块时,描述符的基准地址即为数据存储空间的起始地址。在描述符的数 据基准点是数据存储空间中第一个数据块以外的其他数据时,描述符的基准地址即为该数据块在数据存储空间中的地址。In one possible implementation, the reference address may include the start address of the data storage space. When the data reference point of the descriptor is the first data block of the data storage space, the reference address of the descriptor is the starting address of the data storage space. When the data reference point of the descriptor is other data than the first data block in the data storage space, the reference address of the descriptor is the address of the data block in the data storage space.
在一种可能的实现方式中,张量数据的形状参数包括以下至少一种:数据存储空间在N个维度方向的至少一个方向上的尺寸、该存储区域在N个维度方向的至少一个方向上的尺寸、该存储区域在N个维度方向的至少一个方向上的偏移量、处于N个维度方向的对角位置的至少两个顶点相对于数据基准点的位置、描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系。其中,数据描述位置是描述符所指示的张量数据中的点或区域的映射位置,例如,张量数据为3维数据时,描述符可使用三维空间坐标(x,y,z)来表示该张量数据的形状,该张量数据的数据描述位置可以是使用三维空间坐标(x,y,z)表示的、该张量数据映射在三维空间中的点或区域的位置。In a possible implementation manner, the shape parameter of the tensor data includes at least one of the following: the size of the data storage space in at least one direction of N dimensions, the size of the storage area in at least one direction of N dimensions The size of the storage area, the offset of the storage area in at least one direction of the N dimension directions, the position of at least two vertices at the diagonal positions of the N dimension directions relative to the data reference point, the tensor indicated by the descriptor The mapping relationship between the data description location of the data and the data address. The data description position is the mapping position of the point or area in the tensor data indicated by the descriptor. For example, when the tensor data is 3-dimensional data, the descriptor can be represented by three-dimensional space coordinates (x, y, z). The shape of the tensor data, and the data description position of the tensor data may be the position of a point or area in the three-dimensional space that the tensor data is mapped to, represented by three-dimensional space coordinates (x, y, z).
应当理解,本领域技术人员可以根据实际情况选择表示张量数据的形状参数,本披露对此不作限制。通过在数据存取过程中使用描述符,可建立数据之间的关联,从而降低数据存取的复杂度,提高指令处理效率。It should be understood that those skilled in the art can select the shape parameters representing the tensor data according to the actual situation, which is not limited in the present disclosure. By using descriptors in the data access process, the association between data can be established, thereby reducing the complexity of data access and improving the efficiency of instruction processing.
图2示出根据本披露实施例的处理装置的示意性框图。如图2所示,处理装置200包括控制模块210、运算模块220和存储模块230。FIG. 2 shows a schematic block diagram of a processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 2 , the processing device 200 includes a control module 210 , an arithmetic module 220 and a storage module 230 .
控制模块210可以配置用于控制处理装置200的操作,例如读取存储器或外部传入的指令,对指令进行解码(译码),向相应的部件发出微操作控制信号等。具体地,控制模块210可以配置成根据接收的指令,控制执行单元220执行相应的处理。指令可以包括但不限于数据存取指令、运算指令、描述符管理指令以及同步指令等。本披露对指令的具体类型及解码的具体方式不作限制。The control module 210 can be configured to control the operation of the processing device 200, such as reading instructions from memory or externally, decoding (decoding) the instructions, and issuing micro-operation control signals to corresponding components. Specifically, the control module 210 may be configured to control the execution unit 220 to perform corresponding processing according to the received instruction. The instructions may include, but are not limited to, data access instructions, operation instructions, descriptor management instructions, and synchronization instructions. The present disclosure does not limit the specific type of instruction and the specific manner of decoding.
解码后的指令包括操作码和操作数。当指令涉及对张量数据的处理时,该指令的至少一个操作数可以包括至少一个描述符,该描述符指示以下至少一项信息:张量数据的形状信息和张量数据的空间信息。The decoded instruction includes the opcode and operands. When the instruction involves processing of tensor data, at least one operand of the instruction may include at least one descriptor indicating at least one of the following: shape information of the tensor data and spatial information of the tensor data.
运算模块220配置用于在控制模块210的控制下执行具体的指令或操作。运算模块220例如可以包括但不限于算术逻辑单元(arithmetic and logic unit,ALU)、内存存取单元(memory access unit,MAU)、人工智能运算单元(neural functional unit,NFU)等。本披露对执行单元的具体硬件类型不作限制。The arithmetic module 220 is configured to execute specific instructions or operations under the control of the control module 210 . The operation module 220 may include, for example, but not limited to, an arithmetic logic unit (arithmetic and logic unit, ALU), a memory access unit (memory access unit, MAU), an artificial intelligence operation unit (neural functional unit, NFU), and the like. The present disclosure does not limit the specific hardware type of the execution unit.
存储模块230可以配置用于存储各种信息,包括但不限于指令、描述符相关联的信息、张量数据等。存储模块230可以包括各种存储资源,包括但不限于内部存储器和外部存储器。内部存储器例如可以包括寄存器、片上SRAM或其他介质缓存。外部存储器例如可以包括片下存储器。本披露对于存储模块的具体实现不作限制。The storage module 230 may be configured to store various information including, but not limited to, instructions, information associated with descriptors, tensor data, and the like. The storage module 230 may include various storage resources including, but not limited to, internal memory and external memory. Internal memory may include, for example, registers, on-chip SRAM, or other medium caches. External memory may include, for example, off-chip memory. The present disclosure does not limit the specific implementation of the memory module.
可选地或附加地,处理装置200还可以包括张量接口单元(Tensor interface Unit,TIU)240。张量接口单元240可以配置成在控制模块210的控制下,实现与描述符相关联的操作。这些操作可以包括但不限于描述符的注册、修改、注销、解析;对描述符内容的读写等。本披露对张量接口单元的具体硬件类型不作限制。通过这种方式,可以通过专用的硬件实现与描述符相关联的操作,进一步提高张量数据的存取效率。Alternatively or additionally, the processing apparatus 200 may further include a tensor interface unit (Tensor interface Unit, TIU) 240 . The tensor interface unit 240 may be configured to implement operations associated with the descriptor under the control of the control module 210 . These operations may include, but are not limited to, registration, modification, cancellation, and parsing of descriptors; reading and writing of content of descriptors. The present disclosure does not limit the specific hardware type of the tensor interface unit. In this way, operations associated with descriptors can be implemented through dedicated hardware, which further improves the access efficiency of tensor data.
在本披露的一些实施例中,张量接口单元240可以配置成对指令的操作数中包括的描述符进行解析。例如,张量接口单元可以对描述符中包括的张量数据的形状信息进行解析,以确定与该操作数对应的数据在数据存储空间中的数据地址。In some embodiments of the present disclosure, tensor interface unit 240 may be configured to parse descriptors included in operands of instructions. For example, the tensor interface unit may parse the shape information of the tensor data included in the descriptor to determine the data address in the data storage space of the data corresponding to the operand.
尽管在图2中将控制模块210和张量接口单元240示出为两个分离的模块,但是本领域技术人员可以理解,这两个模块/单元也可以实现为一个模块或更多模块,本披露在此方面没有限制。Although the control module 210 and the tensor interface unit 240 are shown as two separate modules in FIG. 2 , those skilled in the art can understand that these two modules/units can also be implemented as one module or more modules. Disclosure is not limited in this regard.
数据处理装置200可以采用通用处理器(例如中央处理器CPU、图形处理器GPU)和/或专用处理器(例如人工智能处理器、科学计算处理器或数字信号处理器等)来实现,本披露对于数据处理装置的具体类型不作限制。The data processing apparatus 200 can be implemented by a general-purpose processor (such as a central processing unit CPU, a graphics processing unit GPU) and/or a special-purpose processor (such as an artificial intelligence processor, a scientific computing processor, or a digital signal processor, etc.). There is no restriction on the specific type of data processing device.
当硬件并行执行指令时,如果并行执行的指令之间存在依赖关系,则可能导致执行结果错误。例如,如果并行执行的两条指令访问同一存储单元,而且这两条指令中至少有一条是写该存储单元的指令,则这两条指令之间存在依赖关系,例如写后读依赖、写后写依赖、或者读后写依赖。此时,如果后一指令在前一指令之前执行,则会导致执行出错。因此必须保证这些指令执行时的顺序一致性,例如 可以通过强制顺序执行,也即后一指令必须等待前一指令完成才能执行。When the hardware executes instructions in parallel, if there is a dependency between the instructions executed in parallel, it may cause an error in the execution result. For example, if two instructions executed in parallel access the same storage unit, and at least one of the two instructions is an instruction to write to the storage unit, there is a dependency between the two instructions, such as read-after-write dependency, write-after-write dependency Write dependencies, or read-after-write dependencies. In this case, if the latter instruction is executed before the previous instruction, an execution error will result. Therefore, the order consistency of the execution of these instructions must be guaranteed, for example, by forced sequential execution, that is, the subsequent instruction must wait for the completion of the previous instruction to execute.
从前面对张量数据的描述可知,张量数据通常为多维数组,数据量较大,因此针对张量数据的指令处理时间相较于标量数据通常会更长。此时如果仍然按照以前的顺序执行方式来对张量数据进行处理,则处理时间过长,效率低下。鉴于此,在本披露的实施例中,提供了一种操作级别的指令并行方案,其中基于操作所针对的张量数据的形状坐标空间的细粒度区域来对操作的并行进行限制,从而可以挖掘操作的并行执行潜力。因此,按照本披露的实施例,在硬件的并行执行时,既能保证执行顺序的一致性,又能提高操作的并行程度,由此确保了处理的准确和效率。As can be seen from the previous description of tensor data, tensor data is usually a multi-dimensional array with a large amount of data, so the instruction processing time for tensor data is usually longer than that for scalar data. At this time, if the tensor data is still processed according to the previous order execution method, the processing time is too long and the efficiency is low. In view of this, in an embodiment of the present disclosure, an operation-level instruction parallelism scheme is provided, in which the parallelism of operations is restricted based on a fine-grained region of the shape coordinate space of the tensor data targeted by the operation, so that mining can be performed. The parallel execution potential of the operation. Therefore, according to the embodiments of the present disclosure, during the parallel execution of hardware, the consistency of the execution order can be ensured, and the degree of parallelism of operations can be improved, thereby ensuring the accuracy and efficiency of processing.
图3A示出根据本披露实施例的处理方法300的示例性流程图。处理方法300例如可以由图2的处理装置200来实现。FIG. 3A shows an exemplary flowchart of a processing method 300 according to an embodiment of the present disclosure. The processing method 300 can be implemented, for example, by the processing apparatus 200 of FIG. 2 .
如图3A所示,方法300开始于步骤S301,获取指令的第一操作。该步骤例如可以由图2的控制模块210来执行。在一些实施例中,第一操作为针对张量数据的操作,该张量数据的形状坐标空间包括至少一个细粒度区域。在一些实施例中,细粒度区域可以包括张量数据的形状坐标空间的一个或多个相邻坐标点。一个细粒度区域是操作的最小单位。As shown in FIG. 3A , the method 300 starts at step S301 , a first operation of obtaining an instruction. This step may be performed, for example, by the control module 210 of FIG. 2 . In some embodiments, the first operation is an operation on tensor data whose shape coordinate space includes at least one fine-grained region. In some embodiments, the fine-grained region may include one or more adjacent coordinate points in the shape coordinate space of the tensor data. A fine-grained region is the smallest unit of operation.
应当注意,本披露所涉及的操作,可以是处理器硬件支持的基本操作,也可以是将该基本操作进行解析后的微指令(例如请求信号等)。本披露对操作的具体类型不做限定。本披露的处理装置可以并行执行两个操作,也可以并行执行两个以上的操作,本披露对并行执行的操作的数量不做限定。并行执行的两个操作可以属于同一指令,也可以属于不同指令,本披露在此方面没有限制。It should be noted that the operations involved in the present disclosure may be basic operations supported by the processor hardware, or may be microinstructions (eg, request signals, etc.) after parsing the basic operations. The present disclosure does not limit the specific type of operation. The processing apparatus of the present disclosure may execute two operations in parallel, or may execute more than two operations in parallel, and the disclosure does not limit the number of operations executed in parallel. The two operations performed in parallel may belong to the same instruction or may belong to different instructions, and the present disclosure is not limited in this respect.
在硬件并行执行指令时,处理器可以并行执行多个操作,为了避免访存冲突,当处理器并行执行的多个操作都是针对相同数据时,处理器将只执行该多个操作中的一个,同时阻塞其他操作,从而导致处理器的效率降低。在本披露的实施例中,将所处理的张量数据的形状坐标空间进一步划分为多个细粒度区域,基于细粒度区域来判断操作是否可以并行执行,由此可以大大提高处理器的效率。When hardware executes instructions in parallel, the processor can execute multiple operations in parallel. In order to avoid memory access conflicts, when multiple operations executed by the processor in parallel are for the same data, the processor will only execute one of the multiple operations. , while blocking other operations, thereby reducing the efficiency of the processor. In the embodiments of the present disclosure, the shape coordinate space of the processed tensor data is further divided into a plurality of fine-grained regions, and whether the operation can be executed in parallel is determined based on the fine-grained regions, thereby greatly improving the efficiency of the processor.
在一些实施例中,细粒度区域的形状、尺寸和/或数量可以至少部分基于以下至少一项来确定:硬件的计算能力;硬件的带宽;以及张量数据的形状坐标空间的大小。硬件计算能力可以是硬件在一个计算周期内并行处理的数据量,硬件带宽可以是数据传输能力,例如单位时间内传输的数据量。In some embodiments, the shape, size, and/or number of fine-grained regions may be determined based, at least in part, on at least one of: the computing power of the hardware; the bandwidth of the hardware; and the size of the shape coordinate space of the tensor data. The hardware computing capability may be the amount of data that the hardware processes in parallel in one computing cycle, and the hardware bandwidth may be the data transmission capability, such as the amount of data transmitted per unit time.
举例来说,应用本披露实施例的处理方法的处理器,其硬件计算能力为一个计算周期内并行处理100位数据,硬件带宽为单位时间内传输200位数据,对于大小为100*100位的二维张量数据,可根据硬件计算能力将张量数据的形状坐标空间划分为100个细粒度区域,其中,每个细粒度区域包括100位数据;也可根据硬件带宽将该形状坐标空间划分为50个细粒度区域,其中,每个细粒度区域包括200位数据。For example, the processor to which the processing method of the embodiment of the present disclosure is applied has the hardware computing capability of processing 100-bit data in parallel in one computing cycle, and the hardware bandwidth of transmitting 200-bit data per unit time. For two-dimensional tensor data, the shape coordinate space of the tensor data can be divided into 100 fine-grained regions according to the hardware computing capability, wherein each fine-grained region includes 100-bit data; the shape coordinate space can also be divided according to the hardware bandwidth is 50 fine-grained regions, wherein each fine-grained region includes 200-bit data.
应该理解,硬件计算能力、硬件带宽可根据处理器硬件的不同而不同,本公开对硬件计算能力、硬件带宽不作限制。通过这种方式,可根据处理器的处理能力(硬件计算能力和/或硬件带宽)来确定细粒度区域的尺寸和/或数量,使得细粒度区域的划分结果更加符合不同硬件使用环境的需求,使得根据细粒度区域执行的操作与处理器的处理能力趋于同步,能够尽可能地发挥硬件的执行效率,从而提高处理器的处理效率。It should be understood that the hardware computing capability and hardware bandwidth may vary according to different processor hardware, and the present disclosure does not limit the hardware computing capability and hardware bandwidth. In this way, the size and/or quantity of the fine-grained area can be determined according to the processing capability (hardware computing capability and/or hardware bandwidth) of the processor, so that the division result of the fine-grained area is more in line with the requirements of different hardware usage environments, The operations performed according to the fine-grained area tend to be synchronized with the processing capability of the processor, and the execution efficiency of the hardware can be exerted as much as possible, thereby improving the processing efficiency of the processor.
需要说明的是,多个细粒度区域的形状和尺寸可以相同,也可以不同。例如,第一操作可以携带第一细粒度的形状和尺寸(各细粒度区域的坐标点位置和数目),并可以将该第一细粒度的设置为8*8=64个坐标点的方块(假设二维张量),而第二操作可以携带第二细粒度的形状和尺寸(例如各细粒度区域的坐标点位置和数目),并可以将该第二细粒度设置为16*16=256个坐标点的方块。即在执行第一操作时,将每8*8=64个坐标点的方块作为一个细粒度区域,而在执行第二操作时,每16*16=256个坐标点的方块作为一个细粒度区域。同样的,第一操作也可以携带第一细粒度的数量(例如设置为4个),而第二操作携带第二细粒度的数量(例如设置为8个)。即在执行第一操作时,将形状坐标空间划分为4个细粒度区域,而在执行第二操作时,将形状坐标空间划分为8个细粒度区域。可以理解,操作中还可以同时携带细粒度的形状、尺寸、数量这几个参数。可以根据需求确定各细粒度区域的形状、尺寸和/或数量,本披露对此不做限定。It should be noted that the shapes and sizes of the multiple fine-grained regions may be the same or different. For example, the first operation may carry the shape and size of the first fine-grained (position and number of coordinate points of each fine-grained area), and the first fine-grained may be set as a block of 8*8=64 coordinate points ( Assuming a two-dimensional tensor), and the second operation can carry the second fine-grained shape and size (such as the position and number of coordinate points of each fine-grained region), and the second fine-grained can be set to 16*16=256 A square of coordinates. That is, when the first operation is performed, the square of every 8*8=64 coordinate points is used as a fine-grained area, and when the second operation is performed, the square of every 16*16=256 coordinate points is used as a fine-grained area. . Likewise, the first operation may also carry a first fine-grained quantity (for example, set to 4), and the second operation may carry a second fine-grained quantity (for example, set to 8). That is, when the first operation is performed, the shape coordinate space is divided into 4 fine-grained regions, and when the second operation is performed, the shape coordinate space is divided into 8 fine-grained regions. It can be understood that the operation can also carry fine-grained parameters such as shape, size, and quantity at the same time. The shape, size and/or number of each fine-grained region can be determined according to requirements, which are not limited in the present disclosure.
继续图3A,在步骤S302中,确定是否存在正在进行的针对该张量数据的第二操作。Continuing with FIG. 3A, in step S302, it is determined whether there is an ongoing second operation on the tensor data.
如前面所提到的,当指令涉及对张量数据的处理时,操作数中包括描述符,通过该描述符可以获取 与张量数据相关的信息。因此,在一些实施例中,描述符中可以包括张量数据的空间信息(例如空间标识ID),根据张量数据的空间信息可以确定指令之间的依赖关系。由于不同的空间ID表示所指向的空间区域不存在依赖关系。因此,可以根据两条指令所处理的张量数据的空间ID是否相同,来快速地判断这两条指令是否存在依赖关系,也即是否针对同一张量数据进行操作。As mentioned earlier, when the instruction involves the processing of tensor data, the operand includes a descriptor through which information related to the tensor data can be obtained. Therefore, in some embodiments, the descriptor may include spatial information of the tensor data (eg, a spatial identification ID), and the dependency between the instructions may be determined according to the spatial information of the tensor data. Since different spatial IDs indicate that there is no dependency between the spatial regions pointed to. Therefore, it can be quickly judged whether there is a dependency relationship between the two instructions according to whether the spatial IDs of the tensor data processed by the two instructions are the same, that is, whether they operate on the same tensor data.
在另一些实施例中,可通过张量数据对应的数据存储区域的占用状态来判断是否存在正在进行的针对该张量数据的第二操作。例如,处理器可以通过查询占用状态列表来判断该张量数据的数据存储区域是否被占用,如被占用,则判断结果为存在正在进行的针对该张量数据的第二操作。占用状态列表可以是预先设置并存储在存储器上,也可以是在处理器开始执行某个任务之前生成,并在该任务完成之后注销。当各数据存储区域的占用状态发生变化时,处理器更新该占用状态列表的内容以记录各数据存储区域的占用状态。本披露对判断是否存在正在进行的针对某个或多个张量数据的第二操作的判断方式不做限定。In other embodiments, whether there is an ongoing second operation on the tensor data may be determined according to the occupancy state of the data storage area corresponding to the tensor data. For example, the processor may determine whether the data storage area of the tensor data is occupied by querying the occupancy status list, and if it is occupied, the judgment result is that there is an ongoing second operation on the tensor data. The occupancy state list may be preset and stored on the memory, or may be generated before the processor starts to execute a certain task, and is logged out after the task is completed. When the occupation status of each data storage area changes, the processor updates the content of the occupation status list to record the occupation status of each data storage area. The present disclosure does not limit the way of judging whether there is an ongoing second operation on one or more tensor data.
接着,在步骤S303中,在存在这样的第二操作时,确定第一操作当前所针对的第一细粒度区域与第二操作当前所针对的第二细粒度区域是否存在重叠。Next, in step S303, when there is such a second operation, it is determined whether there is overlap between the first fine-grained region currently targeted by the first operation and the second fine-grained region currently targeted by the second operation.
第一细粒度区域和第二细粒度区域可以为该张量数据的形状坐标空间中多个细粒度区域中的任意细粒度区域。可以理解的是,针对张量数据的操作,即为针对张量数据的形状坐标空间中各细粒度区域的操作。例如,假设张量数据A为8行16列的二维矩阵,其形状坐标空间为二维空间,每2行4列为一个细粒度区域,则该张量数据的形状坐标空间包括16个细粒度区域。针对张量数据A的写操作可以看做针对此16个细粒度区域的写操作。其执行过程可以为,写第1个细粒度区域(第1-2行第1-4列),第1个细粒度区域写完后写第2个细粒度区域(第1-2行第5-8列),第2个细粒度区域写完后写第3个细粒度区域(第1-2行第9-12列),以此类推,直至写完第16个细粒度区域(第7-8行第13-16列),完成张量数据A的写操作。本领域技术人员可以理解,也可以一次针对多个细粒度区域进行操作,例如一次写两个或更多个细粒度区域,直至完成对全部区域的操作。The first fine-grained region and the second fine-grained region may be any fine-grained regions among multiple fine-grained regions in the shape coordinate space of the tensor data. It can be understood that an operation on tensor data is an operation on each fine-grained region in the shape coordinate space of the tensor data. For example, assuming that tensor data A is a two-dimensional matrix with 8 rows and 16 columns, its shape coordinate space is a two-dimensional space, and every 2 rows and 4 columns is a fine-grained area, then the shape coordinate space of the tensor data includes 16 fine-grained regions. granular area. A write operation for tensor data A can be regarded as a write operation for the 16 fine-grained regions. The execution process can be: write the first fine-grained area (row 1-2, column 1-4), and write the second fine-grained area after the first fine-grained area (row 1-2, column 5) -8 columns), write the third fine-grained area (row 1-2, column 9-12) after the second fine-grained area is written, and so on until the 16th fine-grained area (the seventh -8 rows 13-16 columns), complete the write operation of tensor data A. Those skilled in the art can understand that operations can also be performed on multiple fine-grained regions at a time, for example, two or more fine-grained regions are written at a time until the operations on all regions are completed.
当有针对张量数据的操作时,随着操作的执行,张量数据的形状坐标空间中的细粒度区域的状态,可以包括已被操作完成状态、正在进行操作状态和未被操作状态;或者,对于不需要记录操作是否已完成的情况,状态可以包括被占用状况和可使用状态。操作当前所针对的细粒度区域的状态为正在进行操作状态或被占用状态。由此,当存在针对张量数据的操作时,可以认为是存在针对张量数据的形状坐标空间中的一个细粒度区域的操作,正在被操作或正被占用的细粒度区域,即为操作当前所针对的细粒度区域。When there is an operation on tensor data, with the execution of the operation, the state of the fine-grained region in the shape coordinate space of the tensor data may include the completed state of the operation, the state of being operated, and the state of not being operated; or , for the case where it is not necessary to record whether the operation has been completed, the status can include the occupied status and the available status. The state of the fine-grained region currently targeted by the operation is the operating state or the occupied state. Therefore, when there is an operation on tensor data, it can be considered that there is an operation on a fine-grained region in the shape coordinate space of tensor data, and the fine-grained region that is being operated or is being occupied is the current operation. The fine-grained region targeted.
在一种可能的实现方式中,第一操作当前所针对的第一细粒度区域,可以包括将要执行的第一操作所针对的细粒度区域,例如在操作初始进行时,当指定按照预定顺序执行时,通常为第一个细粒度区域。也可以包括正在执行的第一操作当前所针对的细粒度区域,可以为任意一个细粒度区域。第二操作当前所针对的第二细粒度区域,可以为正在执行的第二操作当前所针对的细粒度区域,可以为任意一个细粒度区域。In a possible implementation manner, the first fine-grained area currently targeted by the first operation may include the fine-grained area targeted by the first operation to be executed. For example, when the operation is initially performed, it is specified to execute in a predetermined order. , usually the first fine-grained region. It may also include the fine-grained area currently targeted by the first operation being executed, which may be any fine-grained area. The second fine-grained area currently targeted by the second operation may be the fine-grained area currently targeted by the second operation being executed, and may be any fine-grained area.
在一种可能的实现方式中,当在第一操作针对张量数据执行操作前,判断是否存在正在进行的针对该张量数据的第二操作时,第一操作当前所针对的第一细粒度区域,为第一操作将要执行的细粒度区域。例如,在第一操作针对张量数据执行操作前,第一操作当前所针对的第一细粒度区域通常为张量数据的形状坐标空间的第一个细粒度区域。此时,第一操作还未对第一细粒度区域执行操作。而正在进行的第二操作当前所针对的第二细粒度区域,可以与第二操作的执行进程相关。若第二操作也刚开始执行,则第二细粒度区域也可为张量数据的形状坐标空间的第一个细粒度区域。此时,第一细粒度区域与第二细粒度区域重叠。若第二操作已经完成第一个细粒度区域的操作,当前所针对的第二细粒度区域为第P个细粒度区域(P为大于1的整数),则第一细粒度区域与第二细粒度区域不重叠。In a possible implementation manner, when judging whether there is a second operation on the tensor data in progress before the first operation performs the operation on the tensor data, the first fine-grained first operation currently targeted by the first operation Region, which is the fine-grained region where the first operation will be performed. For example, before the first operation is performed on the tensor data, the first fine-grained region currently targeted by the first operation is usually the first fine-grained region in the shape coordinate space of the tensor data. At this time, the first operation has not performed an operation on the first fine-grained region. The second fine-grained area currently targeted by the ongoing second operation may be related to the execution process of the second operation. If the second operation has just started to be executed, the second fine-grained region may also be the first fine-grained region in the shape coordinate space of the tensor data. At this time, the first fine-grained area overlaps with the second fine-grained area. If the second operation has completed the operation of the first fine-grained area, and the second fine-grained area currently targeted is the P-th fine-grained area (P is an integer greater than 1), the first fine-grained area and the second fine-grained area The granular regions do not overlap.
在一种可能的实现方式中,当在第一操作针对张量数据的操作过程中,判断是否存在正在进行的针对该张量数据的第二操作时,可根据第一操作的执行进程确定第一细粒度区域,根据第二操作的执行进程确定第二细粒度区域,进而判断第一细粒度区域与第二细粒度区域是否重叠。In a possible implementation manner, when judging whether there is an ongoing second operation on the tensor data during the operation of the first operation on the tensor data, the first operation may be determined according to the execution process of the first operation. A fine-grained area, determining a second fine-grained area according to the execution process of the second operation, and then judging whether the first fine-grained area and the second fine-grained area overlap.
在一种可能的实现方式中,若各操作执行过程的节拍一致,可只在第一操作针对张量数据执行操作前,判断是否存在正在进行的针对该张量数据的第二操作,并判断第一细粒度区域与第二细粒度区 域是否重叠。此处,节拍一致是指在细粒度区域的尺寸相同的情况下,两个操作对于一个细粒度区域的操作时长相同。In a possible implementation manner, if the rhythm of the execution process of each operation is consistent, only before the first operation is performed on the tensor data, it can be judged whether there is a second operation on the tensor data in progress, and then judge whether there is an ongoing second operation on the tensor data. Whether the first fine-grained area overlaps with the second fine-grained area. Here, the beat consistency means that when the size of the fine-grained region is the same, the operation duration of two operations on a fine-grained region is the same.
在一种可能的实现方式中,若各操作执行过程的节拍不一致或不能确定是否一致,可在第一操作针对张量数据的操作过程中,每完成当前所针对的第一细粒度区域的操作后,再继续判断是否存在正在进行的针对该张量数据的第二操作,以及继续判断第一细粒度区域与第二细粒度区域是否重叠,以确定第一操作是否可以继续执行。In a possible implementation manner, if the beats of the execution processes of each operation are inconsistent or it cannot be determined whether they are consistent, during the operation of the first operation on tensor data, each time the operation of the first fine-grained region currently targeted is completed. Then, continue to judge whether there is an ongoing second operation on the tensor data, and continue to judge whether the first fine-grained region and the second fine-grained region overlap, so as to determine whether the first operation can be continued.
在一种可能的实现方式中,可根据坐标地址、指针位置、细粒度区域标识等,来判断第一操作当前所针对的第一细粒度区域与第二操作当前所针对的第二细粒度区域之间是否重叠。例如,可记录各操作当前的张量数据的坐标地址,根据第一操作当前的坐标地址以及第二操作当前的坐标地址,以及坐标地址与细粒度区域之间的对应关系,分别确定第一操作当前所针对的第一细粒度区域以及第二操作当前所针对的第二细粒度区域,进而判断第一细粒度区域与第二细粒度区域是否重叠。坐标地址和细粒度区域都是基于张量数据的形状坐标空间定义的,因此,在知晓形状坐标空间的细粒度划分后,可以直接从坐标地址确定对应的细粒度区域。再例如,可为各操作设置指针,指针指向操作当前所针对的细粒度区域。根据第一操作的指针位置和第二操作的指针位置,分别确定第一操作当前所针对的第一细粒度区域以及第二操作当前所针对的第二细粒度区域,进而判断第一细粒度区域与第二细粒度区域是否重叠。再例如,还可以为各细粒度区域设置标识,通过记录操作当前所针对的细粒度区域的标识来判断第一细粒度区域与第二细粒度区域是否重叠。标识可以包括字母、数字或符号的任意组合。还可以通过其他方式判断第一细粒度区域与第二细粒度区域是否重叠,本公开对第一细粒度区域与第二细粒度区域之间是否重叠的判断依据不作限制。In a possible implementation manner, the first fine-grained area currently targeted by the first operation and the second fine-grained area currently targeted by the second operation may be determined according to the coordinate address, pointer position, fine-grained area identifier, etc. whether they overlap. For example, the coordinate address of the current tensor data of each operation can be recorded, and the first operation can be determined according to the current coordinate address of the first operation and the current coordinate address of the second operation, as well as the correspondence between the coordinate address and the fine-grained area. The first fine-grained area currently targeted and the second fine-grained area currently targeted by the second operation, and then it is determined whether the first fine-grained area and the second fine-grained area overlap. Both the coordinate address and the fine-grained region are defined based on the shape coordinate space of the tensor data. Therefore, after knowing the fine-grained division of the shape coordinate space, the corresponding fine-grained region can be directly determined from the coordinate address. For another example, a pointer may be set for each operation, and the pointer points to the fine-grained region currently targeted by the operation. According to the pointer position of the first operation and the pointer position of the second operation, respectively determine the first fine-grained area currently targeted by the first operation and the second fine-grained area currently targeted by the second operation, and then determine the first fine-grained area Whether to overlap with the second fine-grained region. For another example, an identifier may also be set for each fine-grained area, and whether the first fine-grained area and the second fine-grained area overlap is determined by recording the identifier of the fine-grained area currently targeted by the operation. Identification can include any combination of letters, numbers or symbols. It is also possible to judge whether the first fine-grained area and the second fine-grained area overlap in other ways, and the present disclosure does not limit the basis for judging whether the first fine-grained area and the second fine-grained area overlap.
接着,在步骤S304中,在第一细粒度区域与第二细粒度区域之间不重叠时,执行第一操作。Next, in step S304, when there is no overlap between the first fine-grained region and the second fine-grained region, a first operation is performed.
在一种可能的实现方式中,如果第一操作当前所针对的第一细粒度区域与第二操作当前所针对的第二细粒度区域不重叠,可以是第一细粒度区域是第二操作已经操作完成的细粒度区域,也可以是第二操作不需要进行操作的细粒度区域,此时执行第一操作不会对第二操作的操作过程及操作结果产生影响,可以执行第一操作。In a possible implementation manner, if the first fine-grained area currently targeted by the first operation does not overlap with the second fine-grained area currently targeted by the second operation, it may be that the first fine-grained area is the first fine-grained area that the second operation has already targeted The fine-grained area where the operation is completed may also be a fine-grained area where the second operation does not need to be performed. At this time, executing the first operation will not affect the operation process and operation result of the second operation, and the first operation can be performed.
根据本实施例,能够在第一操作针对的张量数据的形状坐标空间包括至少一个细粒度区域,且存在正在进行的针对张量数据的第二操作时,判断第一操作当前所针对的第一细粒度区域与第二操作当前所针对的第二细粒度区域之间是否有重叠,在两者无重叠时,执行第一操作。这样,第一操作和第二操作当前操作的细粒度区域无重叠即可执行,使得第一操作和第二操作可以同时对同一张量数据进行操作,提高了处理器的处理效率。According to this embodiment, when the shape coordinate space of the tensor data targeted by the first operation includes at least one fine-grained region, and there is an ongoing second operation on the tensor data, it is possible to determine the first operation currently targeted by the first operation. Whether there is overlap between a fine-grained region and the second fine-grained region currently targeted by the second operation, and when there is no overlap between the two, the first operation is performed. In this way, the fine-grained regions of the current operation of the first operation and the second operation can be executed without overlapping, so that the first operation and the second operation can operate on the same tensor data at the same time, which improves the processing efficiency of the processor.
在一种可能的实现方式中,处理方法300还可包括:在第一细粒度区域与第二细粒度区域有重叠时,阻塞第一操作。In a possible implementation manner, the processing method 300 may further include: blocking the first operation when the first fine-grained region overlaps with the second fine-grained region.
在一种可能的实现方式中,第一细粒度区域与第二细粒度区域有重叠,包括第一细粒度区域与第二细粒度区域完全重叠或部分重叠。第一细粒度区域与第二细粒度区域有重叠时,若执行第一操作,则第一操作针对重叠部分区域的操作,可以影响第二操作的执行导致第二操作的操作结果不准确,也可以影响第一操作的执行导致第一操作的操作结果不准确。此时,可阻塞第一操作,即暂停第一操作的执行,可以在第二操作对当前所针对的第二细粒度区域操作完成后,执行第一操作。即第一细粒度区域与第二细粒度区域不重叠时,执行第一操作。In a possible implementation manner, the first fine-grained region and the second fine-grained region overlap, including the first fine-grained region and the second fine-grained region completely or partially overlapping. When the first fine-grained area overlaps with the second fine-grained area, if the first operation is performed, the operation of the first operation on the overlapping partial area may affect the execution of the second operation, resulting in an inaccurate operation result of the second operation. The execution of the first operation may be affected, resulting in an inaccurate operation result of the first operation. At this time, the first operation may be blocked, that is, the execution of the first operation may be suspended, and the first operation may be executed after the second operation completes the operation on the second fine-grained region currently targeted. That is, when the first fine-grained area does not overlap with the second fine-grained area, the first operation is performed.
在本实施例中,在第一细粒度区域与第二细粒度区域有重叠时,阻塞第一操作,能够避免由于各操作的细粒度区域的重叠而导致的操作错误、操作结果不准确,保证各操作的正确性。In this embodiment, when the first fine-grained area overlaps with the second fine-grained area, blocking the first operation can avoid operation errors and inaccurate operation results caused by the overlapping of the fine-grained areas of each operation, ensuring that The correctness of each operation.
在一些实施例中,第一操作和第二操作中可至少一个操作为写操作。即,当对目标数据的操作为写后读(第二操作为写操作,第一操作为读操作)、读后写(第二操作为读操作,第一操作为写操作)或者写后写(第二操作和第一操作均为写操作)时,两个操作之间会存在依赖关系,此时可采用本披露实施例中的方法。在这些实施例中,通过将目标数据的形状坐标空间划分为一个或多个细粒度区域,并以细粒度区域为单位执行操作,可以使得读后写、写后读、写后写等操作既能够正确的执行,得到准确的执行结果,又可以减少操作之间的等待时间,提高处理器的执行效率。In some embodiments, at least one of the first operation and the second operation may be a write operation. That is, when the operation on the target data is read after write (the second operation is a write operation, and the first operation is a read operation), read after write (the second operation is a read operation, and the first operation is a write operation), or write after write (both the second operation and the first operation are write operations), there will be a dependency relationship between the two operations, in which case the method in the embodiment of the present disclosure can be used. In these embodiments, by dividing the shape coordinate space of the target data into one or more fine-grained regions, and performing operations in units of fine-grained regions, operations such as read-after-write, read-after-write, and write-after-write can be achieved. It can be executed correctly to obtain an accurate execution result, and the waiting time between operations can be reduced, and the execution efficiency of the processor can be improved.
在本披露实施例中,基于张量数据形状坐标空间的这种细粒度区域划分,还提供了一种基于细粒 度区域表达的坐标空间范围来确定操作的执行范围的处理方法。In an embodiment of the present disclosure, based on the fine-grained region division of the shape coordinate space of tensor data, a processing method for determining the execution range of an operation based on the coordinate space range expressed by the fine-grained region is also provided.
图3B示意性示出了根据本披露实施例的处理方法的示例性流程图。同样地,图3B的处理方法例如可以由图2的处理装置200来实现。FIG. 3B schematically shows an exemplary flowchart of a processing method according to an embodiment of the present disclosure. Likewise, the processing method of FIG. 3B can be implemented by, for example, the processing apparatus 200 of FIG. 2 .
如图3B所示,在步骤S311,确定允许第一操作使用的张量数量的第一坐标空间范围。该步骤例如可以由图2的控制模块210来执行。第一坐标空间范围例如可以是第一操作所涉及的张量数据的形状坐标空间的一部分。As shown in FIG. 3B, in step S311, a first coordinate space range of the number of tensors allowed to be used by the first operation is determined. This step may be performed, for example, by the control module 210 of FIG. 2 . The first coordinate space range may be, for example, a portion of the shape coordinate space of the tensor data involved in the first operation.
接着,在步骤S312中,确定执行第一操作时将使用的张量数据的第二坐标空间范围。该步骤例如可以由图2的执行单元220来执行。第二坐标空间范围例如可以是第一操作所涉及的张量数据的形状坐标空间的一部分。Next, in step S312, the second coordinate space range of the tensor data to be used when the first operation is performed is determined. This step may be performed, for example, by the execution unit 220 of FIG. 2 . The second coordinate space range may be, for example, a portion of the shape coordinate space of the tensor data involved in the first operation.
最后,在步骤S313中,在第一坐标空间范围与第二坐标空间范围的交集所确定的第三坐标空间范围内,执行该第一操作。该步骤例如可以由图2的执行单元220来执行。Finally, in step S313, the first operation is performed within the third coordinate space range determined by the intersection of the first coordinate space range and the second coordinate space range. This step may be performed, for example, by the execution unit 220 of FIG. 2 .
第一坐标空间范围、第二坐标空间范围和第三坐标空间范围都可以基于张量数据的形状坐标空间中的细粒度区域来表达,也即,可以以细粒度区域为单位来表征第一、第二和第三坐标空间范围。The first coordinate space range, the second coordinate space range, and the third coordinate space range can all be expressed based on fine-grained regions in the shape coordinate space of the tensor data, that is, the first, The second and third coordinate space extents.
在本披露实施例中,通过限制操作执行时所能使用的张量数据的坐标空间范围,例如上面将操作限制在第三坐标空间范围内执行,可以在指令的并行执行时保证指令在每个坐标空间范围上的访问都是顺序的,由此确保了处理的准确和效率。进一步地,由于软件侧的编程通常使用空间坐标来引用张量数据中的数据点或数据块,因此,通过张量数据的坐标空间范围来约束操作的并行执行,可以简化软件侧的代码编程,更有利于指令的执行。In the embodiment of the present disclosure, by restricting the coordinate space range of tensor data that can be used when the operation is executed, for example, by restricting the operation to be executed within the third coordinate space range, it can be ensured that the instruction is executed in each instruction during parallel execution. Accesses over the extent of the coordinate space are all sequential, thereby ensuring accurate and efficient processing. Further, since the programming on the software side usually uses spatial coordinates to refer to data points or data blocks in the tensor data, the parallel execution of operations is constrained by the coordinate space range of the tensor data, which can simplify the code programming on the software side. It is more conducive to the execution of instructions.
在一些实施例中,可以仅在特定条件下才执行上面结合图3A和图3B描述的细粒度区域的重叠性判断,从而缩短判断时间,加速指令处理。In some embodiments, the overlap determination of the fine-grained regions described above in conjunction with FIG. 3A and FIG. 3B may be performed only under certain conditions, thereby shortening the determination time and speeding up instruction processing.
图3C示意性示出了根据本披露另一实施例的处理方法的示例性流程图。FIG. 3C schematically shows an exemplary flowchart of a processing method according to another embodiment of the present disclosure.
如图3C所示,在步骤S321中,获取指令的第一操作。在一些实施例中,第一操作为针对张量数据的操作,其操作数中可以包括张量数据的描述符。As shown in FIG. 3C, in step S321, the first operation of the instruction is obtained. In some embodiments, the first operation is an operation on tensor data, and its operand may include a descriptor of the tensor data.
接着,在步骤S322中,判断是否存在正在进行的针对该张量数据的第二操作。该操作与前面结合图3A描述的步骤S302类似,此处不再重复。Next, in step S322, it is determined whether there is an ongoing second operation on the tensor data. This operation is similar to step S302 described above in conjunction with FIG. 3A , and will not be repeated here.
如果判断不存在这样的第二操作,则方法可以直接跳至步骤S326,也即执行第一操作。这意味着不存在可能与第一操作产生冲突的第二操作,因此第一操作可以立即执行。当存在其他正在执行的操作时,此时第一操作与这些其他操作并行执行。If it is determined that there is no such second operation, the method may directly jump to step S326, that is, execute the first operation. This means that there is no second operation that could conflict with the first operation, so the first operation can be executed immediately. When there are other operations being performed, the first operation is performed in parallel with these other operations at this time.
如果判断存在这样的第二操作,也即可能产生冲突,则方法可以前进到步骤S323,在此进一步判断第一操作与第二操作的数据操作范围是否重叠。可以理解,由于张量数据通常维度较大,不同操作所针对的数据操作范围可能不同。当不同操作的数据操作范围相互之间不存在重叠时,第一操作可以与在先的第二操作并行执行,而不会产生冲突。If it is determined that there is such a second operation, that is, a conflict may occur, the method may proceed to step S323, where it is further determined whether the data operation ranges of the first operation and the second operation overlap. It can be understood that since tensor data usually has a large dimension, the data operation range for different operations may be different. When the data operation ranges of different operations do not overlap with each other, the first operation can be performed in parallel with the preceding second operation without conflict.
可以采用多种方式判断操作的数据操作范围是否重叠。在一些实施例中,可以基于待操作的张量数据的空间信息和/或形状信息来判断数据操作范围是否重叠。张量数据的空间信息与形状信息可以参见前面的详细描述,此处不再重复。张量数据的形状信息可以用于确定操作的访问地址,从而判断两个操作的数据操作范围之间是否存在重叠。访问地址可以是张量数据的坐标空间地址,也可以是张量数据的存储空间地址,本披露在此方面没有限制。There are various ways to determine whether the data manipulation ranges of an operation overlap. In some embodiments, whether the data operation ranges overlap may be determined based on spatial information and/or shape information of the tensor data to be operated on. For the spatial information and shape information of tensor data, please refer to the previous detailed description, which will not be repeated here. The shape information of the tensor data can be used to determine the access address of the operation, so as to determine whether there is overlap between the data operation ranges of the two operations. The access address may be a coordinate space address of tensor data or a storage space address of tensor data, which is not limited in this aspect of the present disclosure.
如果在步骤S323中判断第一操作与第二操作的数据操作范围不重叠,则方法可以跳至步骤S326,也即执行第一操作。这意味着即使第一操作与第二操作访问同一张量数据(步骤S322确定的),只要第一操作与第二操作的数据操作范围不重叠,也即分别访问同一张量数据的相互不重叠的部分,则第一操作可以与第二操作并行执行。If it is determined in step S323 that the data operation ranges of the first operation and the second operation do not overlap, the method may jump to step S326, that is, the first operation is performed. This means that even if the first operation and the second operation access the same tensor data (determined in step S322), as long as the data operation ranges of the first operation and the second operation do not overlap, that is, those accessing the same tensor data respectively do not overlap each other. part, the first operation can be performed in parallel with the second operation.
如果在步骤S323中判断第一操作与第二操作的数据操作范围重叠,则方法可以前进到步骤S324,在此进一步判断第一操作与第二操作当前针对的细粒度区域是否重叠。具体的判断方法可以参考前面结合图3A和图3B的描述。If it is determined in step S323 that the data operation ranges of the first operation and the second operation overlap, the method may proceed to step S324, where it is further determined whether the fine-grained regions currently targeted by the first operation and the second operation overlap. For a specific determination method, reference may be made to the foregoing description in conjunction with FIG. 3A and FIG. 3B .
当步骤S324确定第一操作与第二操作的数据操作范围不重叠时,方法可以前进到步骤S326,也即执行第一操作。由此,可以基于操作的动态执行,来动态判断当前所针对的细粒度区域是否重叠,从 而在细粒度区域的级别上实现操作的并行执行,最大限度地挖掘操作的并行潜力。When it is determined in step S324 that the data operation ranges of the first operation and the second operation do not overlap, the method may proceed to step S326, that is, the first operation is performed. Therefore, based on the dynamic execution of the operation, it is possible to dynamically determine whether the currently targeted fine-grained regions overlap, so as to realize the parallel execution of the operation at the level of the fine-grained region, and maximize the parallel potential of the operation.
若步骤S324确定两个操作的数据操作范围重叠,则此时不能执行第一操作,否则会引起冲突。因此,在步骤S325中,阻塞第一操作。If it is determined in step S324 that the data operation ranges of the two operations overlap, the first operation cannot be performed at this time, otherwise a conflict will be caused. Therefore, in step S325, the first operation is blocked.
在图3C的实施例中,通过首先基于操作的数据操作范围进行静态预判,仅在特定条件下(也即数据操作范围重叠时)才执行动态的细粒度区域的重叠性判断,可以有效地缩短判断时间,加速指令处理。In the embodiment of FIG. 3C , by first performing static pre-judgment based on the data operation range of the operation, only under certain conditions (that is, when the data operation range overlaps), the dynamic fine-grained area overlap judgment is performed, which can effectively Shorten the judgment time and speed up the command processing.
本披露还提供了用于实施图3A、图3B和图3C的处理方法的示例性处理装置。图3D示出根据本披露实施例的处理装置的示意性功能框图。The present disclosure also provides exemplary processing apparatuses for implementing the processing methods of FIGS. 3A, 3B, and 3C. FIG. 3D shows a schematic functional block diagram of a processing apparatus according to an embodiment of the present disclosure.
如图3D所示,处理装置30包括操作获取单元31、第一确定单元32、第二确定单元33和执行单元34。As shown in FIG. 3D , the processing apparatus 30 includes an operation acquisition unit 31 , a first determination unit 32 , a second determination unit 33 and an execution unit 34 .
操作获取单元31配置用于获取指令的第一操作。该第一操作为针对张量数据的操作,该张量数据的形状坐标空间包括至少一个细粒度区域,每个细粒度区域包括形状坐标空间的一个或多个相邻坐标点。第一确定单元32配置用于确定是否存在正在进行的针对该张量数据的第二操作。第二确定单元33配置用于在存在这样的第二操作时,确定第一操作当前所针对的第一细粒度区域与第二操作当前所针对的第二细粒度区域是否存在重叠。执行单元34配置用于在第一细粒度区域与第二细粒度区域不重叠时,执行第一操作。The operation acquisition unit 31 is configured to acquire the first operation of the instruction. The first operation is an operation on tensor data. The shape coordinate space of the tensor data includes at least one fine-grained region, and each fine-grained region includes one or more adjacent coordinate points in the shape coordinate space. The first determination unit 32 is configured to determine whether there is an ongoing second operation on the tensor data. The second determining unit 33 is configured to, when there is such a second operation, determine whether the first fine-grained region currently targeted by the first operation overlaps with the second fine-grained region currently targeted by the second operation. The executing unit 34 is configured to execute the first operation when the first fine-grained region does not overlap with the second fine-grained region.
在一些实施例中,第二确定单元33可以包括第一确定子单元331和第二确定子单元332。第一确定子单元331配置用于确定允许第一操作使用的该张量数据的第一坐标空间范围。第二确定子单元332配置用于确定执行第一操作时将使用的该张量数据的第二坐标空间范围。在这些实施例中,执行单元34可以配置用于在第一坐标空间范围与第二坐标空间范围的交集所确定的第三坐标空间范围内,执行第一操作。在这些实施例中,第一坐标空间范围、第二坐标空间范围和第三坐标空间范围使用该张量数据的形状坐标空间中的细粒度区域来表征。In some embodiments, the second determination unit 33 may include a first determination subunit 331 and a second determination subunit 332 . The first determination subunit 331 is configured to determine a first coordinate space range of the tensor data that is allowed to be used by the first operation. The second determination subunit 332 is configured to determine the second coordinate space range of the tensor data to be used when the first operation is performed. In these embodiments, the execution unit 34 may be configured to execute the first operation within the third coordinate space range determined by the intersection of the first coordinate space range and the second coordinate space range. In these embodiments, the first coordinate space extent, the second coordinate space extent, and the third coordinate space extent are characterized using fine-grained regions in the shape coordinate space of the tensor data.
在一些实施例中,处理装置30还可以包括阻塞单元35和第三确定单元36。阻塞单元35可以配置用于在确定第一操作与第二操作当前所针对的细粒度区域重叠时,阻塞第一操作,以防止发生冲突。In some embodiments, the processing device 30 may further include a blocking unit 35 and a third determining unit 36 . The blocking unit 35 may be configured to block the first operation to prevent a conflict from occurring when it is determined that the first operation overlaps with the fine-grained region currently targeted by the second operation.
在一些实施例中,第三确定单元36可以配置用于进行预先的静态判断,也即,确定第一操作与第二操作的数据操作范围是否重叠。仅在数据操作范围重叠时,才进行第二确定单元33的判断。执行单元34可以根据各个确定单元的判断结果来执行第一操作。In some embodiments, the third determination unit 36 may be configured to perform a static judgment in advance, that is, to determine whether the data operation ranges of the first operation and the second operation overlap. The judgment of the second determination unit 33 is performed only when the data operation ranges overlap. The execution unit 34 may execute the first operation according to the judgment result of each determination unit.
本领域技术人员可以理解,图3D所示的各个单元是按照功能实现进行划分的。这种划分仅仅是示例性地,在实际实现中,两个或更多个功能可以在同一个硬件单元中实现,一个功能也可能分布在两个硬件单元中实现。例如,在一种实现中,操作获取单元31与第一确定单元32以及可选的第三确定单元36可以包括在图2所示处理装置200的控制模块210中,而第二确定单元33和执行单元34可以包括在处理装置200的运算模块220中。又例如,在另一种实现中,操作获取单元31、第一确定单元32和第二确定单元33以及可选的第三确定单元36可以包括在图2所示处理装置200的控制模块210中,而执行单元34包括在处理装置200的运算模块220中。Those skilled in the art can understand that each unit shown in FIG. 3D is divided according to function implementation. This division is only exemplary, in actual implementation, two or more functions may be implemented in the same hardware unit, and one function may also be implemented in two hardware units distributed. For example, in one implementation, the operation obtaining unit 31, the first determining unit 32 and the optional third determining unit 36 may be included in the control module 210 of the processing apparatus 200 shown in FIG. 2, while the second determining unit 33 and The execution unit 34 may be included in the arithmetic module 220 of the processing device 200 . For another example, in another implementation, the operation acquisition unit 31 , the first determination unit 32 , the second determination unit 33 and the optional third determination unit 36 may be included in the control module 210 of the processing apparatus 200 shown in FIG. 2 , and the execution unit 34 is included in the operation module 220 of the processing device 200 .
还应当理解,处理装置30中包含的诸单元与参考图3A、图3B和图3C描述的方法中的各个步骤相对应。由此,上文针对方法描述的操作和特征同样适用于处理装置30及其中包含的单元,在此不再赘述。It should also be understood that the units included in the processing device 30 correspond to the various steps in the method described with reference to Figures 3A, 3B and 3C. Therefore, the operations and features described above with respect to the method are also applicable to the processing device 30 and the units included therein, and details are not repeated here.
图4示意性示出根据本披露实施例的坐标空间范围的划分。图4以二维数据为例进行示例性图示,然而本领域技术人员可以理解,同样的方案可以类似地应用于三维甚至更多维的张量数据上。FIG. 4 schematically illustrates the division of coordinate space ranges according to an embodiment of the present disclosure. FIG. 4 uses two-dimensional data as an example for illustrative illustration, but those skilled in the art can understand that the same solution can be similarly applied to three-dimensional or even more dimensional tensor data.
如图4所示,二维张量数据的形状坐标空间400被划分为12个细粒度区域,分别为4001、4002、…、4011和4012。在各个细粒度区域上,保证对其的访问都是顺序的。该张量数据上的任意数据元素(例如数据点)可以通过二维空间坐标(x,y)来表示(其中,X轴水平向右,Y轴垂直向下)。显然,该张量数据上的任意数据元素的坐标均不会超过该形状坐标空间的最大尺寸。As shown in FIG. 4 , the shape coordinate space 400 of the two-dimensional tensor data is divided into 12 fine-grained regions, 4001, 4002, . . . , 4011 and 4012, respectively. On each fine-grained region, access to it is guaranteed to be sequential. Any data element (eg, data point) on the tensor data can be represented by two-dimensional spatial coordinates (x, y) (where the X axis is horizontally to the right, and the Y axis is vertically downward). Obviously, the coordinates of any data element on the tensor data will not exceed the maximum size of the shape's coordinate space.
在一些实施例中,可以将张量数据的形状坐标空间中、与第一操作关联的在先操作当前未使用的所有细粒度区域,确定为第一坐标空间范围。In some embodiments, all fine-grained regions in the shape coordinate space of the tensor data that are not currently used by prior operations associated with the first operation may be determined as the first coordinate space range.
在这些实施例中,例如当在先操作在使用细粒度区域4004、4008、4009-4012时,此时允许第一 操作使用的空间范围(也即第一坐标空间范围)可以包括细粒度区域4001-4003以及细粒度区域4005-4007,如斜线阴影所示区域。In these embodiments, for example, when the fine- grained regions 4004 , 4008 , 4009 - 4012 are used by the previous operation, the spatial range (ie, the first coordinate spatial range) that is allowed to be used by the first operation at this time may include the fine-grained region 4001 -4003 as well as fine-grained regions 4005-4007, as shown with diagonal shading.
可选地或附加地,在一些实施例中,将基于第一操作将访问的张量数据的坐标而确定的细粒度区域,确定为第二坐标空间范围。Alternatively or additionally, in some embodiments, a fine-grained region determined based on the coordinates of the tensor data to be accessed by the first operation is determined as the second coordinate space range.
在这些实施例中,例如当预计第一操作将使用除了细粒度区域4001和细粒度区域4002之外的细粒度区域时(例如,根据将访问的张量数据的坐标来估计),此时执行第一操作将使用的空间范围(也即第二坐标空间范围)可以确定为细粒度区域4003-4012,如点填充所示区域。In these embodiments, for example, when the first operation is expected to use fine-grained regions other than fine-grained region 4001 and fine-grained region 4002 (eg, estimated from the coordinates of the tensor data to be accessed), the execution at this time The spatial extent to be used by the first operation (ie, the second coordinate spatial extent) may be determined as fine-grained regions 4003-4012, as indicated by dot filling.
继而,根据本披露实施例,第一操作实际执行时可以操作的范围,也即第三坐标空间范围,为第一坐标空间范围与第二坐标空间范围的交集。如图4所示,在当前示例中,第三坐标空间范围为既存在斜线阴影,又存在点填充的区域,也即图4中的细粒度区域4003和4005-4007。Then, according to the embodiment of the present disclosure, the operable range when the first operation is actually performed, that is, the third coordinate space range, is the intersection of the first coordinate space range and the second coordinate space range. As shown in FIG. 4 , in the current example, the third coordinate space range is an area where both oblique line shadows and dot filling exist, that is, the fine-grained areas 4003 and 4005-4007 in FIG. 4 .
在一些实施例中,第一坐标空间范围、第二坐标空间范围和第三坐标空间范围可以直接使用各自所包括的细粒度区域的标识来表征。例如,在图4所示的示例中,第一坐标空间范围可以使用细粒度区域4001-4003以及细粒度区域4005-4007的标识来表征;第二坐标空间范围可以使用细粒度区域4003-4012的标识来表征;第三坐标空间范围可以使用细粒度区域4003和4005-4007的标识来表征。In some embodiments, the first coordinate space range, the second coordinate space range, and the third coordinate space range can be directly characterized by the identification of the fine-grained region included in each of them. For example, in the example shown in FIG. 4, the first coordinate space range can be characterized by the identification of fine-grained regions 4001-4003 and fine-grained regions 4005-4007; The third coordinate space range can be characterized using the identification of fine-grained regions 4003 and 4005-4007.
考虑到大部分情况下,对张量数据的访问通常是按照某个维度,访问坐标逐渐递增、从前到后地遍历张量数据中每个坐标点上的数据单元。Considering that in most cases, the access to tensor data is usually based on a certain dimension, the access coordinates are gradually increased, and the data units on each coordinate point in the tensor data are traversed from front to back.
因此,在另一些实施例中,第一坐标空间范围使用允许第一操作使用的细粒度区域在张量数据的一个或多个维度上的坐标上界来表征;和/或第二坐标空间范围使用预计第一操作将使用的细粒度区域在张量数据的一个或多个维度上的坐标下界来表征。通过利用张量数据的这种按照维度顺序访问的特性,可以仅使用坐标上界或坐标下界来表征第一坐标空间范围或第二坐标空间范围,由此可以简化控制信息和相应的控制方法。Thus, in other embodiments, the first coordinate space extent is characterized using a coordinate upper bound on one or more dimensions of the tensor data that allows the fine-grained region used by the first operation; and/or the second coordinate space extent Characterized using coordinate lower bounds in one or more dimensions of the tensor data for the fine-grained region that the first operation is expected to use. By utilizing the feature of accessing tensor data in order of dimensions, only the upper coordinate bound or the lower coordinate bound can be used to characterize the first coordinate space range or the second coordinate space range, thereby simplifying control information and corresponding control methods.
仍然以图4为例,如前所述,第一坐标空间范围可以使用允许第一操作使用的、位于张量数据的一个或多个维度上的坐标上界的细粒度区域来表征。例如,当在先操作在使用最右边一列和最下面一行、总计6个细粒度区域时,此时允许第一操作使用的空间范围(也即第一坐标空间范围)可以包括左上两行三列、总计6个细粒度区域,如斜线阴影所示区域。此时,在图4中,第一坐标空间范围可以通过X轴上的X上界411以及Y轴上的Y上界421所处的细粒度区域来表征。在此示例中,第一坐标空间范围可以使用细粒度区域4003和4005来表征,其表明第一操作访问的数据坐标在X维度上不能超过细粒度区域4003,并且在Y维度上不能超过细粒度区域4005。Still taking FIG. 4 as an example, as previously described, the first coordinate space extent may be characterized using a fine-grained region of upper bounds on coordinates on one or more dimensions of the tensor data that are allowed to be used by the first operation. For example, when the previous operation used the rightmost column and the lowermost row, a total of 6 fine-grained regions, the space range allowed to be used by the first operation (that is, the first coordinate space range) at this time may include the upper left two rows and three columns , a total of 6 fine-grained regions, as indicated by the diagonally shaded regions. At this time, in FIG. 4 , the first coordinate space range can be characterized by the X upper bound 411 on the X axis and the fine-grained region where the Y upper bound 421 on the Y axis is located. In this example, the first coordinate space extent can be characterized using fine-grained regions 4003 and 4005, which indicate that the data coordinates accessed by the first operation cannot exceed the fine-grained region 4003 in the X dimension and cannot exceed the fine-grained region in the Y dimension Area 4005.
类似地,第二坐标空间范围可以使用预计第一操作将使用的、位于张量数据的一个或多个维度上的坐标下界的细粒度区域来表征。例如,当基于第一操作将访问的张量数据的坐标,确定第一操作将使用除了第一行的左边两个细粒度区域之外的细粒度区域,可以确定第二坐标空间范围包括其余10个细粒度区域,如点填充部分所示区域。此时,在图4中,第二坐标空间范围可以通过X轴上的X下界412以及Y轴上的Y下界422所处的细粒度区域来表征。在此示例中,第二坐标空间范围可以使用细粒度区域4002和4001来表征,其表明执行第一操作时,不会访问该张量数据中X维度低于细粒度区域4002,且Y维度低于细粒度区域4001的数据。Similarly, the second coordinate space extent may be characterized using a fine-grained region of lower bounds on coordinates on one or more dimensions of the tensor data that the first operation is expected to use. For example, when it is determined that the first operation will use fine-grained regions other than the two fine-grained regions on the left of the first row based on the coordinates of the tensor data to be accessed by the first operation, it can be determined that the second coordinate space range includes the remaining 10 a fine-grained region, as shown in the dot-filled section. At this time, in FIG. 4 , the second coordinate space range can be characterized by the X lower bound 412 on the X axis and the fine-grained region where the Y lower bound 422 on the Y axis is located. In this example, the second coordinate space range can be characterized by fine- grained regions 4002 and 4001, which indicate that when the first operation is performed, the X dimension in the tensor data is lower than the fine-grained region 4002, and the Y dimension is lower than that in the tensor data. Data in the fine-grained region 4001.
第一操作实际执行时可以操作的范围为第三坐标空间范围,其为第一坐标空间范围与第二坐标空间范围的交集。如图4所示,在当前示例中,第三坐标空间范围为既存在斜线阴影又存在点填充的区域,也即图4中的“反L型”区域。The operable range when the first operation is actually performed is the third coordinate space range, which is the intersection of the first coordinate space range and the second coordinate space range. As shown in FIG. 4 , in the current example, the third coordinate space range is an area where both oblique line shadows and dot filling exist, that is, the “inverse L-shaped” area in FIG. 4 .
可以采取多种方式来确定第一坐标空间范围和第二坐标空间范围。The first coordinate space extent and the second coordinate space extent can be determined in various ways.
在一些实施例中,例如可以附加考虑以下至少一项来确定第一坐标空间范围:操作的先后顺序;操作所涉及的操作数;在先操作的第二坐标空间范围。例如,在采用坐标空间上界、下界方式来表征坐标空间范围的实施例中,可以根据前序操作或指令使用的该张量数据的坐标空间下界作为当前新指令使用的该张量数据的上界。In some embodiments, for example, at least one of the following may be additionally considered to determine the first coordinate space range: the sequence of operations; the operands involved in the operations; the second coordinate space range of the previous operation. For example, in an embodiment in which an upper bound and a lower bound of the coordinate space are used to represent the coordinate space range, the lower bound of the coordinate space of the tensor data used by the previous operation or instruction can be used as the upper bound of the tensor data used by the current new instruction. boundary.
在一个示例中,当第一操作(也即当前操作)为读操作时,坐标空间上界为最近的(也即前序操作或在先操作)针对该张量数据的写操作的坐标空间下界。In an example, when the first operation (ie, the current operation) is a read operation, the upper bound of the coordinate space is the lower bound of the coordinate space of the nearest (ie, the previous operation or the previous operation) for the write operation on the tensor data .
在另一示例中,当第一操作为写操作时,坐标空间上界为最近的针对该张量数据的写操作的坐标 空间下界与这两个写操作之间所有的针对所述张量数据的读操作的坐标空间下界中的最小值。通过选取最小值,可以确保第一操作的执行不会影响前序任一操作的执行。In another example, when the first operation is a write operation, the upper bound of the coordinate space is the lower bound of the coordinate space of the most recent write operation on the tensor data and all the tensor data between the two write operations. The minimum value in the lower bound of the coordinate space of the read operation. By selecting the minimum value, it can be ensured that the execution of the first operation will not affect the execution of any preceding operation.
可选地或附加地,第二坐标空间范围可以基于以下至少一项来确定:操作的执行范围;操作的访问模式;以及操作的当前执行状态。例如,在采用坐标空间上界、下界方式来表征坐标空间范围的实施例中,可以综合考虑上述因素来确定第二坐标空间范围,保证在按照维度访问张量数据时,对应维度上的坐标不小于该坐标空间下界。进一步地,尽量提供坐标空间下界的最大值,从而留给后续操作或指令的可访问空间范围也会越大。Alternatively or additionally, the second coordinate space range may be determined based on at least one of: an execution range of the operation; an access mode of the operation; and a current execution state of the operation. For example, in the embodiment in which the upper and lower bounds of the coordinate space are used to represent the coordinate space range, the above factors can be comprehensively considered to determine the second coordinate space range, so as to ensure that when accessing tensor data according to the dimension, the coordinates on the corresponding dimension are not less than the lower bound of the coordinate space. Further, try to provide the maximum value of the lower bound of the coordinate space, so that the accessible space range for subsequent operations or instructions will be larger.
在一个示例中,当第一操作的访问模式为顺序且连续访问时,坐标空间下界可以基于第一操作的最小访问坐标来确定。例如,坐标空间下界可以确定为最小访问坐标所处的细粒度区域。如图4所示,当第一操作按照X维度访问数据时,假设所访问数据的最小X坐标为A,其位于左起第2个细粒度区域中,则X下界可以确定为第2个细粒度区域;当第一操作按照Y维度访问数据时,假设所访问数据的最小Y坐标为B,落在从上数第3个细粒度区域中,则Y下界可以确定为第3个细粒度区域。In one example, when the access mode of the first operation is sequential and sequential access, the coordinate space lower bound may be determined based on the minimum access coordinate of the first operation. For example, the coordinate space lower bound may be determined as the fine-grained region in which the smallest access coordinate is located. As shown in Figure 4, when the first operation accesses data according to the X dimension, assuming that the minimum X coordinate of the accessed data is A, which is located in the second fine-grained area from the left, the X lower bound can be determined as the second fine-grained area. Granularity area; when the first operation accesses data according to the Y dimension, assuming that the minimum Y coordinate of the accessed data is B and falls in the third fine-grained area from the top, the Y lower bound can be determined as the third fine-grained area .
在另一示例中,当第一操作的访问模式为有规律的访问时,坐标空间下界可以基于该规律来确定。例如,在卷积运算中,可能需要分块访问张量数据,因此可以根据卷积运算的分块规律来确定坐标空间下界。In another example, when the access mode of the first operation is regular access, the coordinate space lower bound may be determined based on the regularity. For example, in the convolution operation, it may be necessary to access the tensor data in blocks, so the lower bound of the coordinate space can be determined according to the block law of the convolution operation.
在又一示例中,当无法确定第一操作的访问模式时,可以基于预定设置来确定坐标空间下界。例如,坐标空间下界可以是默认值,例如0或1个或多个细粒度区域的大小。In yet another example, when the access mode of the first operation cannot be determined, the coordinate space lower bound may be determined based on a predetermined setting. For example, the coordinate space lower bound may be a default value such as 0 or the size of 1 or more fine-grained regions.
在一些实施例中,第一和第二坐标空间范围可以基于对张量数据的形状坐标空间的预先划分来确定。具体地,可以首先将张量数据的形状坐标空间划分为若干个空间块,例如在各个维度上均匀或不均匀划分,每个空间块包括一个或多个细粒度区域。例如,仍然参考图4,张量数据的形状坐标空间被划分为6个空间块,每个空间块A、B、C、D、E和F包括2个细粒度区域。In some embodiments, the first and second coordinate space extents may be determined based on a pre-division of the shape coordinate space of the tensor data. Specifically, the shape coordinate space of the tensor data may be firstly divided into several spatial blocks, for example, divided uniformly or non-uniformly in each dimension, and each spatial block includes one or more fine-grained regions. For example, still referring to FIG. 4, the shape coordinate space of tensor data is divided into 6 spatial blocks, each spatial block A, B, C, D, E and F including 2 fine-grained regions.
在这些实施例中,可以将张量数据的形状坐标空间中、在先操作已完成的空间块,确定为第一坐标空间范围;以及将基于第一操作将访问的张量数据的坐标而确定的空间块,确定为第二坐标空间范围。In these embodiments, a block of space in the shape coordinate space of the tensor data, where the previous operation has been completed, may be determined as the first coordinate space extent; and will be determined based on the coordinates of the tensor data to be accessed by the first operation The space block is determined as the second coordinate space range.
例如,当在先操作已完成对空间块A、B的访问,正在使用空间块C时,此时允许第一操作使用的空间范围(也即第一坐标空间范围)可以包括空间块A和空间块B。进一步地,例如当预计第一操作将使用空间块A和空间块B时(例如,根据将访问的张量数据的坐标来估计),此时执行第一操作将使用的空间范围(也即第二坐标空间范围)可以确定为空间块A和空间块B。For example, when the previous operation has completed the access to the space blocks A and B and is using the space block C, the space range (that is, the first coordinate space range) that is allowed to be used by the first operation at this time may include the space block A and the space block C. block B. Further, for example, when it is expected that the first operation will use the spatial block A and the spatial block B (for example, estimated according to the coordinates of the tensor data to be accessed), the spatial range (that is, the first operation will be used) is performed at this time. The two-coordinate space range) can be determined as space block A and space block B.
继而,根据本披露实施例,第一操作实际执行时可以操作的范围,也即第三坐标空间范围,为第一坐标空间范围与第二坐标空间范围的交集。在当前示例中,第三坐标空间范围为空间块A和B。Then, according to the embodiment of the present disclosure, the operable range when the first operation is actually performed, that is, the third coordinate space range, is the intersection of the first coordinate space range and the second coordinate space range. In the current example, the third coordinate space extents are space blocks A and B.
可选地或附加地,在一些实施例中,在第三坐标空间范围内,可以基于以下至少一个顺序来执行第一操作:预定的空间块顺序;和/或预定的细粒度区域顺序。Alternatively or additionally, in some embodiments, within the third coordinate space, the first operation may be performed based on at least one order of: a predetermined spatial block order; and/or a predetermined fine-grained region order.
在一些实现中,在对待操作的张量数据的形状坐标空间预先分好块之后,例如图4的六个空间块,可以预先确定一个空间块顺序,也即对坐标空间中各个空间块的操作顺序,例如按照空间块A、B、C、D、E和F的顺序。这种情况下,如果存在依赖关系的两条指令的操作对象或使用空间都是整个张量数据,则可以让指令按照该顺序对空间块逐个进行操作。例如,假设在先的指令1是写张量数据,在后的指令2是读该张量数据,则指令1可以先对空间块A进行写操作,之后对空间块B进行写操作。此时,可以让指令2开始对空间块A进行读操作。若空间块的划分使得指令2与指令1的执行节拍一致,则在后续时间中,当指令1开始对空间块C进行写操作时,指令2也已完成对空间块A的读操作,并开始对空间块B的读操作;以此类推。由此可以看出,空间块的划分有利于指令的并行执行,空间块顺序的约定有利于简化操作调度,缩短处理时间,提高处理效率。In some implementations, after the shape coordinate space of the tensor data to be operated is divided into blocks in advance, such as the six space blocks in FIG. 4 , a space block sequence can be predetermined, that is, the operation on each space block in the coordinate space The order, for example, in the order of spatial blocks A, B, C, D, E and F. In this case, if the operands or usage spaces of the two instructions with dependencies are the entire tensor data, the instructions can operate on the space blocks one by one in that order. For example, assuming that the preceding instruction 1 is to write tensor data, and the following instruction 2 is to read the tensor data, then instruction 1 can first perform a write operation on space block A, and then perform a write operation on space block B. At this point, the instruction 2 can start to read the space block A. If the division of space blocks makes the execution rhythm of instruction 2 and instruction 1 consistent, then in the subsequent time, when instruction 1 starts to write space block C, instruction 2 has also completed the read operation of space block A, and starts A read operation on space block B; and so on. It can be seen that the division of space blocks is conducive to the parallel execution of instructions, and the convention of the order of space blocks is conducive to simplifying operation scheduling, shortening processing time, and improving processing efficiency.
可选地或附加地,在一些实现中,在单个空间块内执行第一操作时,也可以按照预定的细粒度区域顺序来执行。当在单个空间块内进一步基于当前操作的细粒度区域来控制并行执行的指令的操作范围时,这种按照预定的细粒度区域顺序执行的方式有利于简化操作调度,其原理类似于前面描述的按预定空间块顺序执行的原理,此处不再赘述。Alternatively or additionally, in some implementations, when the first operation is performed within a single spatial block, it may also be performed in a predetermined fine-grained region order. When the operation scope of the parallel-executed instructions is further controlled based on the fine-grained region of the current operation within a single space block, this sequential execution of the predetermined fine-grained region is beneficial to simplify the operation scheduling, the principle of which is similar to that described above The principle of executing in the order of predetermined space blocks will not be repeated here.
在又一些实施例中,第一和第二坐标空间范围还可以通过将动态确定与张量数据形状坐标空间的 预先划分结合起来确定。具体地,可以首先将张量数据的形状坐标空间划分为若干个空间块,例如在各个维度上均匀或不均匀划分。接着,在每个空间块内,可以基于操作的执行动态地确定第一和第二坐标空间范围。具体的确定方式可以参考前文的描述,此处不再赘述。在这些实现中,当无法确定某一空间块内的第二坐标空间范围的精确位置时,可以默认为该空间块所对应的范围。In yet other embodiments, the first and second coordinate space extents may also be determined by combining dynamic determination with pre-partitioning of the tensor data shape coordinate space. Specifically, the shape coordinate space of the tensor data may be firstly divided into several space blocks, for example, divided uniformly or non-uniformly in each dimension. Next, within each spatial block, the first and second coordinate spatial extents may be determined dynamically based on the execution of the operation. For a specific determination manner, reference may be made to the foregoing description, which is not repeated here. In these implementations, when the exact position of the second coordinate space range in a certain space block cannot be determined, the range corresponding to the space block can be defaulted.
在一些实施例中,张量数据的形状坐标空间的预先划分可以基于以下至少一项来进行:硬件的处理能力;预先设定的参数;以及张量数据的形状坐标空间的大小。硬件的处理能力例如可以包括但不限于硬件所能处理的数据位宽。基于硬件所能处理的数据位宽,对张量数据的形状坐标空间进行分割,可以充分发挥硬件的处理能力,提高并行处理效率。预先设定的参数例如可以直接指定要划分的空间块数量、空间块各个维度的大小等等。基于张量数据的形状坐标空间的大小/尺寸,对张量数据的形状坐标空间进行分割。例如,张量数据为一个二维矩阵时,其大小为M行*N列(M、N均为正整数),则可以将每行平均分割成m份,每列平均分割成n份,从而总计m*n个空间块。In some embodiments, the pre-division of the shape coordinate space of the tensor data may be performed based on at least one of the following: the processing capability of the hardware; preset parameters; and the size of the shape coordinate space of the tensor data. The processing capability of the hardware may include, but is not limited to, the data bit width that the hardware can process. Based on the data bit width that the hardware can process, the shape coordinate space of the tensor data is divided, which can give full play to the processing capability of the hardware and improve the efficiency of parallel processing. For example, the preset parameters can directly specify the number of space blocks to be divided, the size of each dimension of the space block, and so on. The shape coordinate space of the tensor data is segmented based on the size/dimension of the shape coordinate space of the tensor data. For example, when the tensor data is a two-dimensional matrix, its size is M rows*N columns (M and N are positive integers), then each row can be divided into m parts on average, and each column can be divided into n parts on average, so that A total of m*n space blocks.
尽管在图4中示出了平均划分的六个空间块,但是也可以划分为大小不均等的各种数量的空间块,本披露在具体的划分方式上没有限制。以上描述了在硬件并行执行操作时,为确保对数据处理的顺序一致性,同时提高并行处理效率,而对操作实际使用的空间范围进行约束的方案。本领域技术人员可以理解,当前操作(例如,前述第一操作)与在先操作(或前序操作)可以分别为并行执行的不同指令中的操作;并且当前操作与在先操作也可以分别为同一指令中的并行执行的不同操作,本披露在此方面没有限制。Although the six space blocks divided equally are shown in FIG. 4 , they can also be divided into various numbers of space blocks with unequal sizes, and the present disclosure does not limit the specific division manner. The above describes the scheme of constraining the space range actually used by the operation in order to ensure the sequential consistency of data processing and improve the efficiency of parallel processing when the hardware performs operations in parallel. Those skilled in the art can understand that the current operation (for example, the aforementioned first operation) and the previous operation (or the previous operation) may be operations in different instructions executed in parallel; and the current operation and the previous operation may also be respectively The present disclosure is not limited in this regard to different operations performed in parallel within the same instruction.
上面已经参考流程图描述了本披露实施例的处理装置所执行的处理方法。本领域技术人员可以理解,由于基于所处理数据的坐标空间范围对并行执行的操作进行约束,可以使得在保证操作执行的顺序一致性的同时,提高操作的并行程度,从而提高处理效率。需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本披露并不受所描述的动作顺序的限制,因为依据本披露,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本披露所必须的。The processing method performed by the processing apparatus of the embodiment of the present disclosure has been described above with reference to the flowchart. Those skilled in the art can understand that since operations performed in parallel are constrained based on the coordinate space range of the processed data, the degree of parallelism of operations can be improved while ensuring sequential consistency of operations, thereby improving processing efficiency. It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present disclosure is not limited by the described action sequence. As in accordance with the present disclosure, certain steps may be performed in other orders or concurrently. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily required by the present disclosure.
进一步需要说明的是,虽然方法流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,方法流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be further noted that although the steps in the method flow chart are displayed in sequence according to the arrows, these steps are not necessarily executed in the sequence indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in the method flow chart may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed and completed at the same time, but may be executed at different times. The order of execution is also not necessarily sequential, but may be performed alternately or alternately with other steps or sub-steps of other steps or at least a portion of a phase.
图5是示出根据本披露实施例的一种组合处理装置500的结构图。如图5中所示,该组合处理装置500包括计算处理装置502、接口装置504、其他处理装置506和存储装置508。根据不同的应用场景,计算处理装置中可以包括一个或多个计算装置510,该计算装置可以配置成图2所示的处理装置200,用于执行本文结合附图4所描述的操作。FIG. 5 is a structural diagram illustrating a combined processing apparatus 500 according to an embodiment of the present disclosure. As shown in FIG. 5 , the combined processing device 500 includes a computing processing device 502 , an interface device 504 , other processing devices 506 and a storage device 508 . According to different application scenarios, the computing and processing apparatus may include one or more computing apparatuses 510, and the computing apparatus may be configured as the processing apparatus 200 shown in FIG. 2 for performing the operations described herein in conjunction with FIG. 4 .
在不同的实施例中,本披露的计算处理装置可以配置成执行用户指定的操作。在示例性的应用中,该计算处理装置可以实现为单核人工智能处理器或者多核人工智能处理器。类似地,包括在计算处理装置内的一个或多个计算装置可以实现为人工智能处理器核或者人工智能处理器核的部分硬件结构。当多个计算装置实现为人工智能处理器核或人工智能处理器核的部分硬件结构时,就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。In various embodiments, the computing processing devices of the present disclosure may be configured to perform user-specified operations. In an exemplary application, the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core. When multiple computing devices are implemented as an artificial intelligence processor core or a part of the hardware structure of an artificial intelligence processor core, for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure.
在示例性的操作中,本披露的计算处理装置可以通过接口装置与其他处理装置进行交互,以共同完成用户指定的操作。根据实现方式的不同,本披露的其他处理装置可以包括中央处理器(Central Processing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)、人工智能处理器等通用和/或专用处理器中的一种或多种类型的处理器。这些处理器可以包括但不限于数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算处理装置而 言,其可以视为具有单核结构或者同构多核结构。然而,当将计算处理装置和其他处理装置共同考虑时,二者可以视为形成异构多核结构。In an exemplary operation, the computing processing apparatus of the present disclosure may interact with other processing apparatuses through an interface apparatus to jointly complete an operation specified by a user. According to different implementations, other processing devices of the present disclosure may include central processing units (Central Processing Unit, CPU), graphics processing units (Graphics Processing Unit, GPU), artificial intelligence processors and other general-purpose and/or special-purpose processors. One or more types of processors. These processors may include, but are not limited to, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs. As mentioned above, only for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when computing processing devices and other processing devices are considered together, the two can be viewed as forming a heterogeneous multi-core structure.
在一个或多个实施例中,该其他处理装置可以作为本披露的计算处理装置(其可以具体化为人工智能例如神经网络运算的相关运算装置)与外部数据和控制的接口,执行包括但不限于数据搬运、对计算装置的开启和/或停止等基本控制。在另外的实施例中,其他处理装置也可以和该计算处理装置协作以共同完成运算任务。In one or more embodiments, the other processing device may serve as an interface for the computing processing device of the present disclosure (which may be embodied as a related computing device for artificial intelligence such as neural network operations) with external data and control, performing operations including but not limited to Limited to basic controls such as data movement, starting and/or stopping computing devices. In other embodiments, other processing apparatuses may also cooperate with the computing processing apparatus to jointly complete computing tasks.
在一个或多个实施例中,该接口装置可以用于在计算处理装置与其他处理装置间传输数据和控制指令。例如,该计算处理装置可以经由所述接口装置从其他处理装置中获取输入数据,写入该计算处理装置片上的存储装置(或称存储器)。进一步,该计算处理装置可以经由所述接口装置从其他处理装置中获取控制指令,写入计算处理装置片上的控制缓存中。替代地或可选地,接口装置也可以读取计算处理装置的存储装置中的数据并传输给其他处理装置。In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing and processing device may obtain input data from other processing devices via the interface device, and write the input data into the on-chip storage device (or memory) of the computing and processing device. Further, the computing and processing device may obtain control instructions from other processing devices via the interface device, and write them into a control cache on the computing and processing device chip. Alternatively or alternatively, the interface device can also read the data in the storage device of the computing processing device and transmit it to other processing devices.
附加地或可选地,本披露的组合处理装置还可以包括存储装置。如图中所示,该存储装置分别与所述计算处理装置和所述其他处理装置连接。在一个或多个实施例中,存储装置可以用于保存所述计算处理装置和/或所述其他处理装置的数据。例如,该数据可以是在计算处理装置或其他处理装置的内部或片上存储装置中无法全部保存的数据。Additionally or alternatively, the combined processing device of the present disclosure may also include a storage device. As shown in the figure, the storage device is connected to the computing processing device and the other processing device, respectively. In one or more embodiments, a storage device may be used to store data of the computing processing device and/or the other processing device. For example, the data may be data that cannot be fully stored in an internal or on-chip storage device of a computing processing device or other processing device.
在一些实施例里,本披露还公开了一种芯片(例如图6中示出的芯片602)。在一种实现中,该芯片是一种系统级芯片(System on Chip,SoC),并且集成有一个或多个如图5中所示的组合处理装置。该芯片可以通过对外接口装置(如图6中示出的对外接口装置606)与其他相关部件相连接。该相关部件可以例如是摄像头、显示器、鼠标、键盘、网卡或wifi接口。在一些应用场景中,该芯片上可以集成有其他处理单元(例如视频编解码器)和/或接口模块(例如DRAM接口)等。在一些实施例中,本披露还公开了一种芯片封装结构,其包括了上述芯片。在一些实施例里,本披露还公开了一种板卡,其包括上述的芯片封装结构。下面将结合图6对该板卡进行详细地描述。In some embodiments, the present disclosure also discloses a chip (eg, chip 602 shown in FIG. 6). In one implementation, the chip is a System on Chip (SoC) and integrates one or more combined processing devices as shown in FIG. 5 . The chip can be connected with other related components through an external interface device (such as the external interface device 606 shown in FIG. 6 ). The relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card or a wifi interface. In some application scenarios, other processing units (such as video codecs) and/or interface modules (such as DRAM interfaces) may be integrated on the chip. In some embodiments, the present disclosure also discloses a chip package structure including the above-mentioned chip. In some embodiments, the present disclosure also discloses a board including the above-mentioned chip package structure. The board will be described in detail below with reference to FIG. 6 .
图6是示出根据本披露实施例的一种板卡600的结构示意图。如图6中所示,该板卡包括用于存储数据的存储器件604,其包括一个或多个存储单元610。该存储器件可以通过例如总线等方式与控制器件608和上文所述的芯片602进行连接和数据传输。进一步,该板卡还包括对外接口装置606,其配置用于芯片(或芯片封装结构中的芯片)与外部设备612(例如服务器或计算机等)之间的数据中继或转接功能。例如,待处理的数据可以由外部设备通过对外接口装置传递至芯片。又例如,所述芯片的计算结果可以经由所述对外接口装置传送回外部设备。根据不同的应用场景,所述对外接口装置可以具有不同的接口形式,例如其可以采用标准PCIE接口等。FIG. 6 is a schematic structural diagram illustrating a board 600 according to an embodiment of the present disclosure. As shown in FIG. 6 , the board includes a storage device 604 for storing data, which includes one or more storage units 610 . The storage device can be connected to the control device 608 and the chip 602 described above for connection and data transmission through, for example, a bus. Further, the board also includes an external interface device 606, which is configured for data relay or transfer function between the chip (or a chip in a chip package structure) and an external device 612 (such as a server or a computer, etc.). For example, the data to be processed can be transmitted to the chip by an external device through an external interface device. For another example, the calculation result of the chip may be transmitted back to the external device via the external interface device. According to different application scenarios, the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface and the like.
在一个或多个实施例中,本披露板卡中的控制器件可以配置用于对所述芯片的状态进行调控。为此,在一个应用场景中,该控制器件可以包括单片机(Micro Controller Unit,MCU),以用于对所述芯片的工作状态进行调控。In one or more embodiments, the control device in the board of the present disclosure may be configured to regulate the state of the chip. To this end, in an application scenario, the control device may include a single-chip microcomputer (Micro Controller Unit, MCU) for regulating the working state of the chip.
根据上述结合图5和图6的描述,本领域技术人员可以理解本披露也公开了一种电子设备或装置,其可以包括一个或多个上述板卡、一个或多个上述芯片和/或一个或多个上述组合处理装置。According to the above description in conjunction with FIG. 5 and FIG. 6 , those skilled in the art can understand that the present disclosure also discloses an electronic device or device, which may include one or more of the above-mentioned boards, one or more of the above-mentioned chips and/or one or a plurality of the above-mentioned combined processing devices.
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从 云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。According to different application scenarios, the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment. The vehicles include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph. The electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal. In one or more embodiments, the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。It should be noted that, for the purpose of simplicity, the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行划分,而实际实现时也可以有另外的划分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。In terms of specific implementation, based on the disclosure and teaching of this disclosure, those skilled in the art can understand that several embodiments disclosed in this disclosure can also be implemented in other ways not disclosed herein. For example, as for each unit in the foregoing electronic device or apparatus embodiment, this article divides them on the basis of considering logical functions, and there may also be other division methods in actual implementation. As another example, multiple units or components may be combined or integrated into another system, or some features or functions of a unit or component may be selectively disabled. As far as the connection relationship between different units or components is concerned, the connections discussed above in conjunction with the accompanying drawings may be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。In this disclosure, units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units. The aforementioned components or elements may be co-located or distributed over multiple network elements. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.
在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本披露的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本披露实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。In some implementation scenarios, the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the aspects of the present disclosure are embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure. The aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如CPU、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。In other implementation scenarios, the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors. In this regard, the various types of devices described herein (eg, computing devices or other processing devices) may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
依据以下条款可更好地理解前述内容:The foregoing can be better understood in accordance with the following terms:
条款1.一种处理方法,所述方法包括: Clause 1. A method of processing, the method comprising:
获取指令的第一操作,所述第一操作为针对张量数据的操作,所述张量数据的形状坐标空间包括至少一个细粒度区域,所述细粒度区域包括所述形状坐标空间的一个或多个相邻坐标点;The first operation of the acquisition instruction, the first operation is an operation on tensor data, and the shape coordinate space of the tensor data includes at least one fine-grained region, and the fine-grained region includes one or one of the shape coordinate space. Multiple adjacent coordinate points;
确定是否存在正在进行的针对所述张量数据的第二操作;determining whether there is an ongoing second operation on the tensor data;
在存在所述第二操作时,确定所述第一操作当前所针对的第一细粒度区域与所述第二操作当前所针对的第二细粒度区域是否存在重叠;以及When the second operation exists, determining whether the first fine-grained region currently targeted by the first operation overlaps with the second fine-grained region currently targeted by the second operation; and
在所述第一细粒度区域与所述第二细粒度区域不重叠时,执行所述第一操作。The first operation is performed when the first fine-grained region does not overlap with the second fine-grained region.
条款2.根据条款1所述的方法,其中所述方法还包括:Clause 2. The method of clause 1, wherein the method further comprises:
在所述第一细粒度区域与所述第二细粒度区域存在重叠时,阻塞所述第一操作。The first operation is blocked when the first fine-grained region overlaps the second fine-grained region.
条款3.根据条款1-2任一所述的方法,其中所述方法还包括:Clause 3. The method of any of clauses 1-2, wherein the method further comprises:
确定所述第一操作的数据操作范围与所述第二操作的数据操作范围是否重叠;determining whether the data operation range of the first operation overlaps the data operation range of the second operation;
当所述数据操作范围重叠时,执行所述确定所述第一操作当前所针对的第一细粒度区域与所述第二操作当前所针对的第二细粒度区域是否存在重叠;以及When the data operation ranges overlap, performing the determining whether the first fine-grained region currently targeted by the first operation overlaps with the second fine-grained region currently targeted by the second operation; and
当所述数据操作范围不重叠时,执行所述第一操作。The first operation is performed when the data operation ranges do not overlap.
条款4.根据条款3所述的方法,其中基于以下至少一项来确定所述第一操作的数据操作范围与所述第二操作的数据操作范围是否重叠:Clause 4. The method of Clause 3, wherein determining whether the data manipulation scope of the first operation overlaps the data manipulation scope of the second operation is determined based on at least one of the following:
待操作的张量数据的空间信息;和/或Spatial information of the tensor data to be operated on; and/or
待操作的张量数据的形状信息。The shape information of the tensor data to be operated on.
条款5.根据条款1-4任一所述的方法,其中所述方法还包括:Clause 5. The method of any of clauses 1-4, wherein the method further comprises:
确定允许所述第一操作使用的所述张量数据的第一坐标空间范围;determining a first coordinate space range of the tensor data that is allowed to be used by the first operation;
确定执行所述第一操作时将使用的所述张量数据的第二坐标空间范围;以及determining a second coordinate space extent of the tensor data to be used when performing the first operation; and
在所述第一坐标空间范围与所述第二坐标空间范围的交集所确定的第三坐标空间范围内,执行所述第一操作;Perform the first operation within a third coordinate space range determined by the intersection of the first coordinate space range and the second coordinate space range;
其中所述第一坐标空间范围、第二坐标空间范围和第三坐标空间范围使用所述细粒度区域来表征。The first coordinate space range, the second coordinate space range and the third coordinate space range are characterized by the fine-grained region.
条款6.根据条款5所述的方法,其中基于以下至少一项来确定所述第一坐标空间范围:Clause 6. The method of clause 5, wherein the first coordinate space extent is determined based on at least one of:
操作的先后顺序;the sequence of operations;
操作所涉及的操作数;the number of operands involved in the operation;
在先操作的第二坐标空间范围;以及the second coordinate space extent of the prior operation; and
对所述张量数据的形状坐标空间的预定划分。A predetermined division of the shape coordinate space of the tensor data.
条款7.根据条款5-6任一所述的方法,其中基于以下至少一项来确定所述第二坐标空间范围:Clause 7. The method of any of clauses 5-6, wherein the second coordinate space extent is determined based on at least one of the following:
操作的执行范围;the scope of execution of the operation;
操作的访问模式;the access mode of the operation;
操作的当前执行状态;以及the current execution status of the operation; and
对所述张量数据的形状坐标空间的预定划分。A predetermined division of the shape coordinate space of the tensor data.
条款8.根据条款5-7任一所述的方法,其中:Clause 8. The method of any of clauses 5-7, wherein:
确定所述第一坐标空间范围包括:确定允许所述第一操作使用的所述张量数据的一个或多个维度的坐标空间上界;和/或Determining the first coordinate space extent includes: determining an upper bound on the coordinate space of one or more dimensions of the tensor data that is allowed to be used by the first operation; and/or
确定所述第二坐标空间范围包括:确定预计所述第一操作将使用的所述张量数据的一个或多个维度的坐标空间下界。Determining the second coordinate space extent includes determining a coordinate space lower bound for one or more dimensions of the tensor data expected to be used by the first operation.
条款9.根据条款1-8任一所述的方法,其中,所述细粒度区域的尺寸和/或数量至少部分基于以下至少一项来确定:Clause 9. The method of any of clauses 1-8, wherein the size and/or number of the fine-grained regions is determined based, at least in part, on at least one of the following:
硬件的计算能力;the computing power of the hardware;
硬件的带宽;以及the bandwidth of the hardware; and
所述张量数据的形状坐标空间的大小。The size of the shape coordinate space of the tensor data.
条款10.根据条款1-9任一所述的方法,其中所述第一操作和所述第二操作中的至少一个操作为写操作。Clause 10. The method of any of clauses 1-9, wherein at least one of the first operation and the second operation is a write operation.
条款11.根据条款1-10任一所述的方法,其中:Clause 11. The method of any of clauses 1-10, wherein:
所述第一操作与所述第二操作分别为并行执行的不同指令中的操作;或者The first operation and the second operation are respectively operations in different instructions executed in parallel; or
所述第一操作与所述第二操作分别为同一指令中的并行执行的不同操作。The first operation and the second operation are respectively different operations performed in parallel in the same instruction.
条款12.一种处理装置,包括:Clause 12. A processing device comprising:
操作获取单元,配置用于获取指令的第一操作,所述第一操作为针对张量数据的操作,所述张量数据的形状坐标空间包括至少一个细粒度区域,所述细粒度区域包括所述形状坐标空间的一个或多个相邻坐标点;An operation obtaining unit configured to obtain a first operation of an instruction, where the first operation is an operation on tensor data, the shape coordinate space of the tensor data includes at least one fine-grained area, and the fine-grained area includes all one or more adjacent coordinate points in the shape coordinate space;
第一确定单元,配置用于确定是否存在正在进行的针对所述张量数据的第二操作;a first determining unit, configured to determine whether there is an ongoing second operation on the tensor data;
第二确定单元,配置用于在存在所述第二操作时,确定所述第一操作当前所针对的第一细粒度区域与所述第二操作当前所针对的第二细粒度区域是否存在重叠;以及A second determining unit, configured to determine whether the first fine-grained region currently targeted by the first operation and the second fine-grained region currently targeted by the second operation overlap when the second operation exists ;as well as
执行单元,配置用于在所述第一细粒度区域与所述第二细粒度区域不重叠时,执行所述第一操作。An execution unit configured to execute the first operation when the first fine-grained region does not overlap with the second fine-grained region.
条款13.根据条款12所述的处理装置,其中所述处理装置还包括:Clause 13. The processing apparatus of clause 12, wherein the processing apparatus further comprises:
阻塞单元,配置用于在所述第一细粒度区域与所述第二细粒度区域存在重叠时,阻塞所述第一操作。A blocking unit, configured to block the first operation when the first fine-grained region and the second fine-grained region overlap.
条款14.根据条款12-13任一所述的处理装置,其中所述处理装置还包括:Clause 14. The processing apparatus of any of clauses 12-13, wherein the processing apparatus further comprises:
第三确定单元,配置用于确定所述第一操作的数据操作范围与所述第二操作的数据操作范围是否重叠;并且a third determination unit configured to determine whether the data operation range of the first operation overlaps the data operation range of the second operation; and
所述第二确定单元配置用于当所述第三确定单元确定所述数据操作范围重叠时,执行所述确定所述第一操作当前所针对的第一细粒度区域与所述第二操作当前所针对的第二细粒度区域是否存在重叠;以及The second determining unit is configured to, when the third determining unit determines that the data operation ranges overlap, perform the determining of the first fine-grained area currently targeted by the first operation and the current second operation. whether there is overlap in the second fine-grained region targeted; and
所述执行单元配置用于当所述第三确定单元确定所述数据操作范围不重叠时,执行所述第一操作。The execution unit is configured to execute the first operation when the third determination unit determines that the data operation ranges do not overlap.
条款15.根据条款14所述的处理装置,其中所述第三确定单元基于以下至少一项来确定所述第一操作的数据操作范围与所述第二操作的数据操作范围是否重叠:Clause 15. The processing device of Clause 14, wherein the third determination unit determines whether the data manipulation range of the first operation overlaps the data manipulation range of the second operation based on at least one of the following:
待操作的张量数据的空间信息;和/或Spatial information of the tensor data to be operated on; and/or
待操作的张量数据的形状信息。The shape information of the tensor data to be operated on.
条款16.根据条款12-15所述的处理装置,其中所述第二确定单元进一步包括:Clause 16. The processing device of clauses 12-15, wherein the second determining unit further comprises:
第一确定子单元,配置用于确定允许所述第一操作使用的所述张量数据的第一坐标空间范围;a first determination subunit, configured to determine a first coordinate space range of the tensor data that is allowed to be used by the first operation;
第二确定子单元,配置用于确定执行所述第一操作时将使用的所述张量数据的第二坐标空间范围;并且a second determination subunit, configured to determine a second coordinate space range of the tensor data to be used when performing the first operation; and
所述执行单元进一步配置用于在所述第一坐标空间范围与所述第二坐标空间范围的交集所确定的第三坐标空间范围内,执行所述第一操作,The execution unit is further configured to execute the first operation within a third coordinate space range determined by the intersection of the first coordinate space range and the second coordinate space range,
其中所述第一坐标空间范围、第二坐标空间范围和第三坐标空间范围使用所述细粒度区域来表征。The first coordinate space range, the second coordinate space range and the third coordinate space range are characterized by the fine-grained region.
条款17.根据条款16所述的处理装置,其中所述第一坐标空间范围基于以下至少一项来确定:操作的先后顺序;Clause 17. The processing device of Clause 16, wherein the first coordinate space extent is determined based on at least one of: a sequence of operations;
操作所涉及的操作数;the number of operands involved in the operation;
在先操作的第二坐标空间范围;以及the second coordinate space extent of the prior operation; and
对所述张量数据的形状坐标空间的预定划分。A predetermined division of the shape coordinate space of the tensor data.
条款18.根据条款16-17任一所述的处理装置,其中所述第二坐标空间范围基于以下至少一项来确定:Clause 18. The processing device of any of clauses 16-17, wherein the second coordinate space extent is determined based on at least one of:
操作的执行范围;the scope of execution of the operation;
操作的访问模式;the access mode of the operation;
操作的当前执行状态;以及the current execution status of the operation; and
对所述张量数据的形状坐标空间的预定划分。A predetermined division of the shape coordinate space of the tensor data.
条款19.根据条款16-18任一所述的处理装置,其中:Clause 19. The processing device of any of clauses 16-18, wherein:
所述第一确定子单元进一步配置成:确定允许所述第一操作使用的所述张量数据的一个或多个维度的坐标空间上界;和/或The first determination subunit is further configured to: determine an upper bound on the coordinate space of one or more dimensions of the tensor data that is allowed to be used by the first operation; and/or
所述第二确定子单元进一步配置成:确定预计所述第一操作将使用的所述张量数据的一个或多个维度的坐标空间下界。The second determination subunit is further configured to: determine a coordinate space lower bound of one or more dimensions of the tensor data that is expected to be used by the first operation.
条款20.根据条款12-19任一所述的处理装置,其中,所述细粒度区域的尺寸和/或数量至少部分基于以下至少一项来确定:Clause 20. The processing device of any of clauses 12-19, wherein the size and/or number of the fine-grained regions is determined based, at least in part, on at least one of the following:
硬件的计算能力;the computing power of the hardware;
硬件的带宽;以及the bandwidth of the hardware; and
所述张量数据的形状坐标空间的大小。The size of the shape coordinate space of the tensor data.
条款21.根据条款12-20任一所述的处理装置,其中所述第一操作和所述第二操作中的至少一个操作为写操作。 Clause 21. The processing device of any of clauses 12-20, wherein at least one of the first operation and the second operation is a write operation.
条款22.根据条款12-21任一所述的处理装置,其中: Clause 22. The processing device of any of clauses 12-21, wherein:
所述第一操作与所述第二操作分别为并行执行的不同指令中的操作;或者The first operation and the second operation are respectively operations in different instructions executed in parallel; or
所述第一操作与所述第二操作分别为同一指令中的并行执行的不同操作。The first operation and the second operation are respectively different operations performed in parallel in the same instruction.
条款23.一种芯片,所述芯片包括如条款12-22任一所述的处理装置。 Clause 23. A chip comprising the processing device of any of clauses 12-22.
条款24.一种板卡,所述板卡包括条款23所述的芯片。 Clause 24. A board comprising the chip of clause 23.

Claims (24)

  1. 一种处理方法,所述方法包括:A processing method, the method includes:
    获取指令的第一操作,所述第一操作为针对张量数据的操作,所述张量数据的形状坐标空间包括至少一个细粒度区域,所述细粒度区域包括所述形状坐标空间的一个或多个相邻坐标点;The first operation of the acquisition instruction, the first operation is an operation on tensor data, the shape coordinate space of the tensor data includes at least one fine-grained region, and the fine-grained region includes one or one of the shape coordinate space. Multiple adjacent coordinate points;
    确定是否存在正在进行的针对所述张量数据的第二操作;determining whether there is an ongoing second operation on the tensor data;
    在存在所述第二操作时,确定所述第一操作当前所针对的第一细粒度区域与所述第二操作当前所针对的第二细粒度区域是否存在重叠;以及When the second operation exists, determining whether the first fine-grained region currently targeted by the first operation overlaps with the second fine-grained region currently targeted by the second operation; and
    在所述第一细粒度区域与所述第二细粒度区域不重叠时,执行所述第一操作。The first operation is performed when the first fine-grained region does not overlap with the second fine-grained region.
  2. 根据权利要求1所述的方法,其中所述方法还包括:The method of claim 1, wherein the method further comprises:
    在所述第一细粒度区域与所述第二细粒度区域存在重叠时,阻塞所述第一操作。The first operation is blocked when the first fine-grained region overlaps the second fine-grained region.
  3. 根据权利要求1-2任一所述的方法,其中所述方法还包括:The method according to any one of claims 1-2, wherein the method further comprises:
    确定所述第一操作的数据操作范围与所述第二操作的数据操作范围是否重叠;determining whether the data operation range of the first operation overlaps the data operation range of the second operation;
    当所述数据操作范围重叠时,执行所述确定所述第一操作当前所针对的第一细粒度区域与所述第二操作当前所针对的第二细粒度区域是否存在重叠;以及When the data operation ranges overlap, performing the determining whether the first fine-grained region currently targeted by the first operation overlaps with the second fine-grained region currently targeted by the second operation; and
    当所述数据操作范围不重叠时,执行所述第一操作。The first operation is performed when the data operation ranges do not overlap.
  4. 根据权利要求3所述的方法,其中基于以下至少一项来确定所述第一操作的数据操作范围与所述第二操作的数据操作范围是否重叠:3. The method of claim 3, wherein determining whether the data manipulation range of the first operation overlaps the data manipulation range of the second operation is determined based on at least one of the following:
    待操作的张量数据的空间信息;和/或Spatial information of the tensor data to be operated on; and/or
    待操作的张量数据的形状信息。The shape information of the tensor data to be operated on.
  5. 根据权利要求1-4任一所述的方法,其中所述方法还包括:The method according to any one of claims 1-4, wherein the method further comprises:
    确定允许所述第一操作使用的所述张量数据的第一坐标空间范围;determining a first coordinate space range of the tensor data that is allowed to be used by the first operation;
    确定执行所述第一操作时将使用的所述张量数据的第二坐标空间范围;以及determining a second coordinate space extent of the tensor data to be used when performing the first operation; and
    在所述第一坐标空间范围与所述第二坐标空间范围的交集所确定的第三坐标空间范围内,执行所述第一操作;Perform the first operation within a third coordinate space range determined by the intersection of the first coordinate space range and the second coordinate space range;
    其中所述第一坐标空间范围、第二坐标空间范围和第三坐标空间范围使用所述细粒度区域来表征。The first coordinate space range, the second coordinate space range and the third coordinate space range are characterized by the fine-grained region.
  6. 根据权利要求5所述的方法,其中基于以下至少一项来确定所述第一坐标空间范围:6. The method of claim 5, wherein the first coordinate space extent is determined based on at least one of the following:
    操作的先后顺序;the sequence of operations;
    操作所涉及的操作数;the number of operands involved in the operation;
    在先操作的第二坐标空间范围;以及the second coordinate space extent of the prior operation; and
    对所述张量数据的形状坐标空间的预定划分。A predetermined division of the shape coordinate space of the tensor data.
  7. 根据权利要求5-6任一所述的方法,其中基于以下至少一项来确定所述第二坐标空间范围:The method of any one of claims 5-6, wherein the second coordinate space extent is determined based on at least one of the following:
    操作的执行范围;the scope of execution of the operation;
    操作的访问模式;the access mode of the operation;
    操作的当前执行状态;以及the current execution status of the operation; and
    对所述张量数据的形状坐标空间的预定划分。A predetermined division of the shape coordinate space of the tensor data.
  8. 根据权利要求5-7任一所述的方法,其中:The method according to any one of claims 5-7, wherein:
    确定所述第一坐标空间范围包括:确定允许所述第一操作使用的所述张量数据的一个或多个维度的坐标空间上界;和/或Determining the first coordinate space extent includes: determining an upper bound on the coordinate space of one or more dimensions of the tensor data that is allowed to be used by the first operation; and/or
    确定所述第二坐标空间范围包括:确定预计所述第一操作将使用的所述张量数据的一个或多个维度的坐标空间下界。Determining the second coordinate space extent includes determining a coordinate space lower bound for one or more dimensions of the tensor data expected to be used by the first operation.
  9. 根据权利要求1-8任一所述的方法,其中,所述细粒度区域的尺寸和/或数量至少部分基于以下至少一项来确定:8. The method of any one of claims 1-8, wherein the size and/or number of the fine-grained regions is determined based at least in part on at least one of the following:
    硬件的计算能力;the computing power of the hardware;
    硬件的带宽;以及the bandwidth of the hardware; and
    所述张量数据的形状坐标空间的大小。The size of the shape coordinate space of the tensor data.
  10. 根据权利要求1-9任一所述的方法,其中所述第一操作和所述第二操作中的至少一个操作为写操作。The method of any one of claims 1-9, wherein at least one of the first operation and the second operation is a write operation.
  11. 根据权利要求1-10任一所述的方法,其中:The method according to any one of claims 1-10, wherein:
    所述第一操作与所述第二操作分别为并行执行的不同指令中的操作;或者The first operation and the second operation are respectively operations in different instructions executed in parallel; or
    所述第一操作与所述第二操作分别为同一指令中的并行执行的不同操作。The first operation and the second operation are respectively different operations performed in parallel in the same instruction.
  12. 一种处理装置,包括:A processing device, comprising:
    操作获取单元,配置用于获取指令的第一操作,所述第一操作为针对张量数据的操作,所述张量数据的形状坐标空间包括至少一个细粒度区域,所述细粒度区域包括所述形状坐标空间的一个或多个相邻坐标点;An operation obtaining unit configured to obtain a first operation of an instruction, where the first operation is an operation on tensor data, the shape coordinate space of the tensor data includes at least one fine-grained area, and the fine-grained area includes all one or more adjacent coordinate points in the shape coordinate space;
    第一确定单元,配置用于确定是否存在正在进行的针对所述张量数据的第二操作;a first determining unit, configured to determine whether there is an ongoing second operation on the tensor data;
    第二确定单元,配置用于在存在所述第二操作时,确定所述第一操作当前所针对的第一细粒度区域与所述第二操作当前所针对的第二细粒度区域是否存在重叠;以及A second determining unit, configured to determine whether the first fine-grained region currently targeted by the first operation and the second fine-grained region currently targeted by the second operation overlap when the second operation exists ;as well as
    执行单元,配置用于在所述第一细粒度区域与所述第二细粒度区域不重叠时,执行所述第一操作。An execution unit configured to execute the first operation when the first fine-grained region does not overlap with the second fine-grained region.
  13. 根据权利要求12所述的处理装置,其中所述处理装置还包括:The processing device of claim 12, wherein the processing device further comprises:
    阻塞单元,配置用于在所述第一细粒度区域与所述第二细粒度区域存在重叠时,阻塞所述第一操作。A blocking unit, configured to block the first operation when the first fine-grained region and the second fine-grained region overlap.
  14. 根据权利要求12-13任一所述的处理装置,其中所述处理装置还包括:The processing device according to any one of claims 12-13, wherein the processing device further comprises:
    第三确定单元,配置用于确定所述第一操作的数据操作范围与所述第二操作的数据操作范围是否重叠;并且a third determination unit configured to determine whether the data operation range of the first operation overlaps the data operation range of the second operation; and
    所述第二确定单元配置用于当所述第三确定单元确定所述数据操作范围重叠时,执行所述确定所述第一操作当前所针对的第一细粒度区域与所述第二操作当前所针对的第二细粒度区域是否存在重叠;以及The second determining unit is configured to, when the third determining unit determines that the data operation ranges overlap, perform the determining of the first fine-grained area currently targeted by the first operation and the current second operation. Whether there is overlap in the second fine-grained region targeted; and
    所述执行单元配置用于当所述第三确定单元确定所述数据操作范围不重叠时,执行所述第一操作。The execution unit is configured to execute the first operation when the third determination unit determines that the data operation ranges do not overlap.
  15. 根据权利要求14所述的处理装置,其中所述第三确定单元基于以下至少一项来确定所述第一操作的数据操作范围与所述第二操作的数据操作范围是否重叠:The processing apparatus according to claim 14, wherein the third determination unit determines whether the data operation range of the first operation overlaps the data operation range of the second operation based on at least one of the following:
    待操作的张量数据的空间信息;和/或Spatial information of the tensor data to be operated on; and/or
    待操作的张量数据的形状信息。The shape information of the tensor data to be operated on.
  16. 根据权利要求12-15所述的处理装置,其中所述第二确定单元进一步包括:The processing device according to claims 12-15, wherein the second determination unit further comprises:
    第一确定子单元,配置用于确定允许所述第一操作使用的所述张量数据的第一坐标空间范围;a first determination subunit, configured to determine a first coordinate space range of the tensor data that is allowed to be used by the first operation;
    第二确定子单元,配置用于确定执行所述第一操作时将使用的所述张量数据的第二坐标空间范围;并且a second determination subunit, configured to determine a second coordinate space range of the tensor data to be used when performing the first operation; and
    所述执行单元进一步配置用于在所述第一坐标空间范围与所述第二坐标空间范围的交集所确定的第三坐标空间范围内,执行所述第一操作,The execution unit is further configured to execute the first operation within a third coordinate space range determined by the intersection of the first coordinate space range and the second coordinate space range,
    其中所述第一坐标空间范围、第二坐标空间范围和第三坐标空间范围使用所述细粒度区域来表征。The first coordinate space range, the second coordinate space range and the third coordinate space range are characterized by the fine-grained region.
  17. 根据权利要求16所述的处理装置,其中所述第一坐标空间范围基于以下至少一项来确定:The processing device of claim 16, wherein the first coordinate space range is determined based on at least one of the following:
    操作的先后顺序;the sequence of operations;
    操作所涉及的操作数;the number of operands involved in the operation;
    在先操作的第二坐标空间范围;以及the second coordinate space extent of the prior operation; and
    对所述张量数据的形状坐标空间的预定划分。A predetermined division of the shape coordinate space of the tensor data.
  18. 根据权利要求16-17任一所述的处理装置,其中所述第二坐标空间范围基于以下至少一项来确定:The processing device according to any one of claims 16-17, wherein the second coordinate space range is determined based on at least one of the following:
    操作的执行范围;the scope of execution of the operation;
    操作的访问模式;the access mode of the operation;
    操作的当前执行状态;以及the current execution status of the operation; and
    对所述张量数据的形状坐标空间的预定划分。A predetermined division of the shape coordinate space of the tensor data.
  19. 根据权利要求16-18任一所述的处理装置,其中:The processing device according to any one of claims 16-18, wherein:
    所述第一确定子单元进一步配置成:确定允许所述第一操作使用的所述张量数据的一个或多个维度的坐标空间上界;和/或The first determination subunit is further configured to: determine an upper bound on the coordinate space of one or more dimensions of the tensor data that is allowed to be used by the first operation; and/or
    所述第二确定子单元进一步配置成:确定预计所述第一操作将使用的所述张量数据的一个或多个维度的坐标空间下界。The second determination subunit is further configured to: determine a coordinate space lower bound of one or more dimensions of the tensor data expected to be used by the first operation.
  20. 根据权利要求12-19任一所述的处理装置,其中,所述细粒度区域的尺寸和/或数量至少部分基于以下至少一项来确定:The processing device of any of claims 12-19, wherein the size and/or number of the fine-grained regions is determined at least in part based on at least one of the following:
    硬件的计算能力;the computing power of the hardware;
    硬件的带宽;以及the bandwidth of the hardware; and
    所述张量数据的形状坐标空间的大小。The size of the shape coordinate space of the tensor data.
  21. 根据权利要求12-20任一所述的处理装置,其中所述第一操作和所述第二操作中的至少一个操作为写操作。The processing apparatus according to any one of claims 12-20, wherein at least one of the first operation and the second operation is a write operation.
  22. 根据权利要求12-21任一所述的处理装置,其中:The processing device according to any one of claims 12-21, wherein:
    所述第一操作与所述第二操作分别为并行执行的不同指令中的操作;或者The first operation and the second operation are respectively operations in different instructions executed in parallel; or
    所述第一操作与所述第二操作分别为同一指令中的并行执行的不同操作。The first operation and the second operation are respectively different operations performed in parallel in the same instruction.
  23. 一种芯片,其特征在于,所述芯片包括如权利要求12-22任一所述的处理装置。A chip, characterized in that the chip includes the processing device according to any one of claims 12-22.
  24. 一种板卡,其特征在于,所述板卡包括权利要求23所述的芯片。A board, characterized in that the board comprises the chip of claim 23 .
PCT/CN2021/123552 2020-11-13 2021-10-13 Processing method, processing apparatus, and related product WO2022100345A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011270378.5A CN114489799A (en) 2020-11-13 2020-11-13 Processing method, processing device and related product
CN202011270378.5 2020-11-13

Publications (1)

Publication Number Publication Date
WO2022100345A1 true WO2022100345A1 (en) 2022-05-19

Family

ID=81489888

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/123552 WO2022100345A1 (en) 2020-11-13 2021-10-13 Processing method, processing apparatus, and related product

Country Status (2)

Country Link
CN (1) CN114489799A (en)
WO (1) WO2022100345A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114816773B (en) * 2022-06-29 2022-09-23 浙江大华技术股份有限公司 Data processing method, system, electronic device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7508397B1 (en) * 2004-11-10 2009-03-24 Nvidia Corporation Rendering of disjoint and overlapping blits
CN105260554A (en) * 2015-10-27 2016-01-20 武汉大学 GPU cluster-based multidimensional big data factorization method
CN111079917A (en) * 2018-10-22 2020-04-28 北京地平线机器人技术研发有限公司 Tensor data block access method and device
CN111401510A (en) * 2019-09-24 2020-07-10 上海寒武纪信息科技有限公司 Data processing method and device, computer equipment and storage medium
CN111857828A (en) * 2019-04-25 2020-10-30 安徽寒武纪信息科技有限公司 Processor operation method and device and related product

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7508397B1 (en) * 2004-11-10 2009-03-24 Nvidia Corporation Rendering of disjoint and overlapping blits
CN105260554A (en) * 2015-10-27 2016-01-20 武汉大学 GPU cluster-based multidimensional big data factorization method
CN111079917A (en) * 2018-10-22 2020-04-28 北京地平线机器人技术研发有限公司 Tensor data block access method and device
CN111857828A (en) * 2019-04-25 2020-10-30 安徽寒武纪信息科技有限公司 Processor operation method and device and related product
CN111401510A (en) * 2019-09-24 2020-07-10 上海寒武纪信息科技有限公司 Data processing method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN114489799A (en) 2022-05-13

Similar Documents

Publication Publication Date Title
US20210150325A1 (en) Data processing method and apparatus, and related product
JP7150802B2 (en) Data processing method, apparatus, and related products
CN111857669A (en) Software and hardware decoupling software radar system, real-time design method and server
WO2022100345A1 (en) Processing method, processing apparatus, and related product
US20240111536A1 (en) Data processing apparatus and related products
WO2022100286A1 (en) Data processing apparatus, data processing method, and related product
WO2021018313A1 (en) Data synchronization method and apparatus, and related product
CN114489803A (en) Processing device, processing method and related product
CN114489805A (en) Processing method, processing device and related product
WO2022062682A1 (en) Data processing device, integrated circuit chip, device, and implementation method therefor
WO2022001499A1 (en) Computing apparatus, chip, board card, electronic device and computing method
WO2022253287A1 (en) Method for generating random number, and related product thereof
CN114489802A (en) Data processing device, data processing method and related product
WO2022111013A1 (en) Device supporting multiple access modes, method and readable storage medium
CN114489804A (en) Processing method, processing device and related product
CN114489788A (en) Instruction processing device, instruction processing method and related product
US11836491B2 (en) Data processing method and apparatus, and related product for increased efficiency of tensor processing
CN114489789A (en) Processing device, processing method and related product
JP7266121B2 (en) Computing equipment, chips, board cards, electronic devices and computing methods
CN112395002B (en) Operation method, device, computer equipment and storage medium
CN112396170B (en) Operation method, device, computer equipment and storage medium
CN115373583A (en) Data access method and related product
CN116185378A (en) Optimization method of calculation graph, data processing method and related products
CN114282159A (en) Data processing device, integrated circuit chip, equipment and method for realizing the same
CN117667211A (en) Instruction synchronous control method, synchronous controller, processor, chip and board card

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21890872

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21890872

Country of ref document: EP

Kind code of ref document: A1