CN114489799A - Processing method, processing device and related product - Google Patents

Processing method, processing device and related product Download PDF

Info

Publication number
CN114489799A
CN114489799A CN202011270378.5A CN202011270378A CN114489799A CN 114489799 A CN114489799 A CN 114489799A CN 202011270378 A CN202011270378 A CN 202011270378A CN 114489799 A CN114489799 A CN 114489799A
Authority
CN
China
Prior art keywords
data
fine
coordinate space
tensor data
grained region
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011270378.5A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Original Assignee
Cambricon Technologies Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambricon Technologies Corp Ltd filed Critical Cambricon Technologies Corp Ltd
Priority to CN202011270378.5A priority Critical patent/CN114489799A/en
Priority to PCT/CN2021/123552 priority patent/WO2022100345A1/en
Publication of CN114489799A publication Critical patent/CN114489799A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7817Specially adapted for signal processing, e.g. Harvard architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Signal Processing (AREA)
  • Image Generation (AREA)

Abstract

The disclosure discloses a processing method, a processing device and a related product. The processing means may be implemented as computing means included in a combined processing means which may also include interface means and other processing means. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for storing data of the computing device and the other processing device. The disclosed solution provides a solution for instruction parallelism, which can improve the instruction parallelism, thereby improving the processing efficiency of the machine.

Description

Processing method, processing device and related product
Technical Field
The disclosure relates to the field of processors, and in particular, to a processing method, a processing apparatus, a chip, and a board card.
Background
The instruction system is an interface for the interaction of computer software and hardware, and is a very important part in the structure of a computer system. With the continuous development of artificial intelligence technology, the amount of data and the data dimension which need to be processed are increasing. Therefore, how to reasonably and scientifically control the execution of instructions, especially to improve the degree of instruction parallelism and the performance of a machine is an important problem in the field of processors.
Disclosure of Invention
To address one or more of the technical problems mentioned above, the present disclosure proposes, in various aspects, a solution to enhance instruction parallelism. By the instruction system disclosed by the invention, the instruction parallelism degree can be improved, so that the processing efficiency of the machine is improved.
In a first aspect, the present disclosure provides a method of processing, the method comprising: obtaining a first operation of an instruction, the first operation being an operation on tensor data, a shape coordinate space of the tensor data comprising at least one fine-grained region, the fine-grained region comprising one or more neighboring coordinate points of the shape coordinate space; determining whether there is an ongoing second operation for the tensor data; when the second operation exists, determining whether a first fine-grained region currently aimed at by the first operation and a second fine-grained region currently aimed at by the second operation overlap; and when the first fine-grained region and the second fine-grained region do not overlap, performing the first operation.
In a second aspect, the present disclosure provides a processing apparatus comprising: an operation acquisition unit configured to acquire a first operation of an instruction, the first operation being an operation on tensor data whose shape coordinate space includes at least one fine-grained region including one or more adjacent coordinate points of the shape coordinate space; a first determination unit configured to determine whether there is an ongoing second operation for the tensor data; a second determining unit, configured to determine, when the second operation exists, whether there is an overlap between a first fine-grained region currently targeted by the first operation and a second fine-grained region currently targeted by the second operation; and an execution unit configured to execute the first operation when the first fine-grained region and the second fine-grained region do not overlap.
In a third aspect, the present disclosure provides a chip comprising the processing device of any of the embodiments of the second aspect.
In a fourth aspect, the present disclosure provides a board card comprising the chip of any of the embodiments of the third aspect.
With the processing apparatus, the processing method, the chip, and the board provided as above, in the execution process of the operation of the instruction, the embodiment of the present disclosure limits the parallelism of the operation based on the fine-grained region of the shape coordinate space of the tensor data for which the operation is directed, so that the potential for parallel execution of the operation can be mined. Therefore, according to the embodiments of the present disclosure, in the parallel execution of hardware, the consistency of the execution sequence can be ensured, and the degree of parallelism of the operations can be improved, thereby ensuring the accuracy and efficiency of the processing.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to like or corresponding parts and in which:
FIG. 1A shows a schematic diagram of a data storage space according to an embodiment of the present disclosure;
FIG. 1B shows a schematic diagram of data chunking in a data storage space, according to an embodiment of the present disclosure;
FIG. 2 shows a schematic block diagram of a processing device according to an embodiment of the present disclosure;
3A-3C illustrate a schematic flow diagram of a processing method according to an embodiment of the present disclosure;
FIG. 3D shows a schematic block diagram of a processing device according to an embodiment of the present disclosure;
FIG. 4 illustrates a schematic diagram of a coordinate space range in accordance with an embodiment of the present disclosure;
FIG. 5 shows a block diagram of a combined treatment device according to an embodiment of the disclosure; and
fig. 6 shows a schematic structural diagram of a board card according to an embodiment of the disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.
It should be understood that the terms "first," "second," "third," and "fourth," etc. as may be used in the claims, the specification, and the drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Computers process various data by executing instructions. To indicate the source of the data, the destination of the operation results, and the operation performed, an instruction typically contains the following information:
(1) the Operation Code (OP) is used to indicate the Operation (e.g., add, subtract, multiply, divide, data transfer, etc.) to be performed by the instruction, and specifies the nature and function of the Operation. A computer may have tens to hundreds of instructions, each with a corresponding opcode, which the computer recognizes to perform different operations.
(2) And the operand is used for describing the operation object of the instruction. Operands may relate to the data type, memory access address, addressing mode, etc. of the operated-on object. The operand may be directly given to the operated-on object, or indicate a memory address or a register address (i.e., a register name) of the operated-on object.
The instructions of conventional processors are designed to perform basic single data scalar operations. Here, a single data scalar operation refers to an instruction where each operand is a scalar datum. However, with the development of artificial intelligence technology, in tasks such as image processing and pattern recognition, the oriented operands are often data types of multidimensional vectors (i.e., tensor data), and the use of only scalar operations does not make hardware efficiently complete the operation task. Therefore, how to efficiently execute multidimensional tensor data processing is also an urgent problem to be solved in the current computing field.
In an embodiment of the present disclosure, an instruction system is provided in which a descriptor is included in an operand of an instruction, by which information related to tensor data can be acquired. In particular, the descriptor may indicate at least one of the following information: shape information of tensor data, and spatial information of tensor data. Shape information of the tensor data can be used to determine the data address of the tensor data corresponding to the operand in the data storage space. The spatial information of the tensor data can be used to determine dependencies between instructions, which in turn can determine, for example, the order of execution of the instructions. Spatial information of tensor data can be indicated by a spatial Identification (ID). The space ID may also be referred to as a space alias, which refers to a space region for storing corresponding tensor data, and the space region may be a continuous space or a multi-segment space. Different spatial IDs indicate that there is no dependency on the spatial region pointed to. For example, it can be ensured that there is no dependency relationship by making spatial regions pointed to by different spatial IDs not overlap with each other.
Various possible implementations of shape information of tensor data are described in detail below in conjunction with the figures.
Tensors may contain multiple forms of data composition. The tensor can be of different dimensions, e.g. a scalar can be regarded as a 0-dimensional tensor, a vector can be regarded as a 1-dimensional tensor, and a matrix can be a tensor of 2 or more dimensions. The shape of the tensor includes information such as the dimensions of the tensor, the sizes of the dimensions of the tensor, and the like. For example, for a three-dimensional tensor:
x3=[[[1,2,3],[4,5,6]];[[7,8,9],[10,11,12]]]
the shape or dimension of the tensor can be expressed as X3That is, the tensor is expressed as a three-dimensional tensor by three parameters, and the size of the tensor in the first dimension is 2, the size of the tensor in the second dimension is 2, and the size of the tensor in the third dimension is 3. When storing tensor data in a memory, the shape of the tensor data cannot be determined according to the data address (or the storage area), and further, related information such as the correlation among a plurality of tensor data cannot be determined, which results in low access efficiency of the processor to the tensor data.
In one possible implementation, the shape of the N-dimensional tensor data may be indicated by a descriptor, N being a positive integer, e.g., N ═ 1, 2, or 3, or zero. The three-dimensional tensor in the above example can be represented by descriptor (2,2, 3). It should be noted that the present disclosure is not limited to the way the descriptors indicate the tensor shape.
In one possible implementation, the value of N may be determined according to the dimension (also referred to as the order) of the tensor data, or may be set according to the usage requirement of the tensor data. For example, when the value of N is 3, the tensor data is three-dimensional tensor data, and the descriptor may be used to indicate the shape (e.g., offset, size, etc.) of the three-dimensional tensor data in three dimensional directions. It should be understood that the value of N can be set by those skilled in the art according to practical needs, and the disclosure does not limit this.
Although tensor data can be multidimensional, there is a correspondence between tensors and storage on memory because the layout of memory is always one-dimensional. Tensor data is typically allocated in contiguous memory space, i.e., the tensor data can be one-dimensionally expanded (e.g., line first) for storage on memory.
This relationship between the tensor and the underlying storage may be represented by an offset of a dimension (offset), a size of a dimension (size), a step size of a dimension (stride), and so on. The offset of a dimension refers to the offset in that dimension from a reference position. The size of a dimension refers to the size of the dimension, i.e., the number of elements in the dimension. The step size of a dimension refers to the interval between adjacent elements in the dimension, for example, the step size of the above three-dimensional tensor is (6,3,1), that is, the step size of the first dimension is 6, the step size of the second dimension is 3, and the step size of the third dimension is 1.
FIG. 1A shows a schematic diagram of a data storage space according to an embodiment of the present disclosure. As shown in fig. 1A, the data storage space 21 stores a two-dimensional data in a line-first manner, which can be represented by (X, Y) (wherein the X axis is horizontally to the right and the Y axis is vertically to the bottom). The size in the X-axis direction (the size of each row, or the total number of columns) is ori _ X (not shown), the size in the Y-axis direction (the total number of rows) is ori _ Y (not shown), and the starting address PA _ start (base address) of the data storage space 21 is the physical address of the first data block 22. The data block 23 is partial data in the data storage space 21, and its offset amount 25 in the X-axis direction is denoted as offset _ X, the offset amount 24 in the Y-axis direction is denoted as offset _ Y, the size in the X-axis direction is denoted as size _ X, and the size in the Y-axis direction is denoted as size _ Y.
In a possible implementation manner, when the data block 23 is defined by using a descriptor, a data reference point of the descriptor may use a first data block of the data storage space 21, and a reference address of the descriptor may be agreed as a starting address PA _ start of the data storage space 21. The content of the descriptor of the data block 23 may then be determined in combination with the size ori _ X of the data storage space 21 in the X axis, the size ori _ Y in the Y axis, and the offset amount offset _ Y of the data block 23 in the Y axis direction, the offset amount offset _ X in the X axis direction, the size _ X in the X axis direction, and the size _ Y in the Y axis direction.
In one possible implementation, the content of the descriptor can be represented using the following formula (1):
Figure BDA0002777484960000061
it should be understood that although the content of the descriptor is represented by a two-dimensional space in the above examples, a person skilled in the art can set the specific dimension of the content representation of the descriptor according to practical situations, and the disclosure does not limit this.
In one possible implementation manner, a reference address of the data reference point of the descriptor in the data storage space may be appointed, and based on the reference address, the content of the descriptor of the tensor data is determined according to the positions of at least two vertexes located at diagonal positions in the N dimensional directions relative to the data reference point.
For example, a reference address PA _ base of a data reference point of the descriptor in the data storage space may be agreed. For example, one data (for example, data with position (2, 2)) may be selected as a data reference point in the data storage space 21, and the physical address of the data in the data storage space may be used as the reference address PA _ base. The content of the descriptor of the data block 23 in fig. 1A can be determined from the positions of the two vertices of the diagonal position with respect to the data reference point. First, the positions of at least two vertices of the diagonal positions of the data block 23 relative to the data reference point are determined, for example, the positions of the diagonal position vertices relative to the data reference point in the top-left-to-bottom-right direction are used, wherein the relative position of the top-left vertex is (x _ min, y _ min), and the relative position of the bottom-right vertex is (x _ max, y _ max), and then the content of the descriptor of the data block 23 can be determined according to the reference address PA _ base, the relative position of the top-left vertex (x _ min, y _ min), and the relative position of the bottom-right vertex (x _ max, y _ max).
In one possible implementation, the content of the descriptor (with reference to PA _ base) can be represented using the following equation (2):
Figure BDA0002777484960000071
it should be understood that although the above examples use the vertex of two diagonal positions of the upper left corner and the lower right corner to determine the content of the descriptor, the skilled person can set the specific vertex of at least two vertices of the diagonal positions according to the actual needs, and the disclosure does not limit this.
In one possible implementation, the content of the descriptor of the tensor data can be determined according to the reference address of the data reference point of the descriptor in the data storage space and the mapping relation between the data description position and the data address of the tensor data indicated by the descriptor. For example, when tensor data indicated by the descriptor is three-dimensional spatial data, the mapping relationship between the data description position and the data address may be defined by using a function f (x, y, z).
In one possible implementation, the content of the descriptor can be represented using the following equation (3):
Figure BDA0002777484960000072
in one possible implementation, the descriptor is further used to indicate an address of the N-dimensional tensor data, wherein the content of the descriptor further includes at least one address parameter representing the address of the tensor data, for example, the content of the descriptor may be the following formula (4):
Figure BDA0002777484960000073
where PA is the address parameter. The address parameter may be a logical address or a physical address. When the descriptor is analyzed, the PA may be used as any one of a vertex, a middle point, or a preset point of the vector shape, and the corresponding data address may be obtained by combining the shape parameters in the X direction and the Y direction.
In one possible implementation, the address parameter of the tensor data comprises a reference address of a data reference point of the descriptor in a data storage space of the tensor data, and the reference address comprises a start address of the data storage space.
In one possible implementation, the descriptor may further include at least one address parameter representing an address of the tensor data, for example, the content of the descriptor may be the following equation (5):
Figure BDA0002777484960000081
wherein PA _ start is a reference address parameter, which is not described again.
It should be understood that, the mapping relationship between the data description location and the data address can be set by those skilled in the art according to practical situations, and the disclosure does not limit this.
In a possible implementation manner, a default base address can be set in a task, the base address is used by descriptors in instructions in the task, and shape parameters based on the base address can be included in the descriptor contents. This base address may be determined by setting an environmental parameter for the task. The relevant description and usage of the base address can be found in the above embodiments. In this implementation, the contents of the descriptor can be mapped to the data address more quickly.
In one possible implementation, the reference address may be included in the content of each descriptor, and the reference address of each descriptor may be different. Compared with the mode of setting a common reference address by using the environment parameters, each descriptor in the mode can describe data more flexibly and use a larger data address space.
In one possible implementation, the data address in the data storage space of the data corresponding to the operand of the processing instruction may be determined according to the content of the descriptor. The calculation of the data address is automatically completed by hardware, and the calculation methods of the data address are different when the content of the descriptor is represented in different ways. The present disclosure does not limit the specific calculation method of the data address.
For example, the content of the descriptor in the operand is expressed by formula (1), the amount of shift of the tensor data indicated by the descriptor in the data storage space is offset _ x and offset _ y, respectively, and the size is size _ x × size _ y, then the starting data address PA1 of the tensor data indicated by the descriptor in the data storage space is(x,y)The following equation (6) may be used to determine:
PA1(x,y)=PA_start+(offset_y-1)*ori_x+offset_x (6)
the data start address PA1 determined according to the above equation (6)(x,y)In combination with the offsets offset _ x and offset _ y and the sizes size _ x and size _ y of the storage area, the storage area of the tensor data indicated by the descriptor in the data storage space can be determined.
In a possible implementation manner, when the operand further includes a data description location for the descriptor, a data address of data corresponding to the operand in the data storage space may be determined according to the content of the descriptor and the data description location. In this way, a portion of the data (e.g., one or more data) in the tensor data indicated by the descriptor may be processed.
For example, the content of the descriptor in the operand is expressed by formula (2), the tensor data indicated by the descriptor is respectively offset in the data storage space by offset _ x and offset _ y, the size is size _ x × size _ y, and the data description position for the descriptor included in the operand is (x) xq,yq) Then, the data address PA2 of the tensor data indicated by the descriptor in the data storage space(x,y)The following equation (7) may be used to determine:
PA2(x,y)=PA_start+(offset_y+yq-1)*ori_x+(offset_x+xq) (7)
in one possible implementation, the descriptor may indicate the data of the block. The data partitioning can effectively accelerate the operation speed and improve the processing efficiency in many applications. For example, in graphics processing, convolution operations often use data partitioning for fast arithmetic processing.
FIG. 1B shows a schematic diagram of data chunking in a data storage space, according to an embodiment of the disclosure. As shown in FIG. 1B, the data storage space 26 also stores two-dimensional data in a row-first manner, which may be represented by (X, Y) (where the X-axis is horizontally to the right and the Y-axis is vertically down). The dimension in the X-axis direction (the dimension of each row, or the total number of columns) is ori _ X (not shown), and the dimension in the Y-axis direction (the total number of rows) is ori _ Y (not shown). Unlike the tensor data of fig. 1A, the tensor data stored in fig. 1B includes a plurality of data blocks.
In this case, the descriptor requires more parameters to represent the data blocks. Taking the X axis (X dimension) as an example, the following parameters may be involved: ori _ x, x.tile.size (size in tile 27), x.tile.stride (step size in tile 28, i.e. the distance between the first point of the first tile and the first point of the second tile), x.tile.num (number of tiles, shown as 3 tiles in fig. 1B), x.stride (overall step size, i.e. the distance from the first point of the first row to the first point of the second row), etc. Other dimensions may similarly include corresponding parameters.
In one possible implementation, the descriptor may include an identification of the descriptor and/or the content of the descriptor. The identifier of the descriptor is used to distinguish the descriptor, for example, the identifier of the descriptor may be its number; the content of the descriptor may include at least one shape parameter representing a shape of the tensor data. For example, the tensor data is 3-dimensional data, of three dimensions of the tensor data, in which shape parameters of two dimensions are fixed, the content of the descriptor thereof may include a shape parameter representing another dimension of the tensor data.
In one possible implementation, the data address of the data storage space corresponding to each descriptor may be a fixed address. For example, separate data storage spaces may be divided for tensor data, each of which has a one-to-one correspondence with descriptors at the start address of the data storage space. In this case, a circuit or module (e.g., an entity external to the disclosed computing device) responsible for parsing the computation instruction may determine the data address in the data storage space of the data corresponding to the operand from the descriptor.
In one possible implementation, when the data address of the data storage space corresponding to the descriptor is a variable address, the descriptor may be further used to indicate an address of the N-dimensional tensor data, wherein the content of the descriptor may further include at least one address parameter indicating the address of the tensor data. For example, the tensor data is 3-dimensional data, when the descriptor points to an address of the tensor data, the content of the descriptor may include one address parameter indicating the address of the tensor data, such as a starting physical address of the tensor data, or may include a plurality of address parameters of the address of the tensor data, such as a starting address of the tensor data + an address offset, or the tensor data is based on the address parameters of each dimension. The address parameters can be set by those skilled in the art according to practical needs, and the disclosure does not limit this.
In one possible implementation, the address parameter of the tensor data may include a reference address of a data reference point of the descriptor in a data storage space of the tensor data. Wherein, the reference address can be different according to the change of the data reference point. The present disclosure does not limit the selection of data reference points.
In one possible implementation, the base address may comprise a start address of the data storage space. When the data reference point of the descriptor is the first data block of the data storage space, the reference address of the descriptor is the start address of the data storage space. When the data reference point of the descriptor is data other than the first data block in the data storage space, the reference address of the descriptor is the address of the data block in the data storage space.
In one possible implementation, the shape parameters of the tensor data include at least one of: the size of the data storage space in at least one direction of the N dimensional directions, the size of the storage area in at least one direction of the N dimensional directions, the offset of the storage area in at least one direction of the N dimensional directions, the positions of at least two vertexes at diagonal positions of the N dimensional directions relative to the data reference point, and the mapping relationship between the data description position of tensor data indicated by the descriptor and the data address. Where the data description position is a mapping position of a point or a region in the tensor data indicated by the descriptor, for example, when the tensor data is 3-dimensional data, the descriptor may represent a shape of the tensor data using three-dimensional space coordinates (x, y, z), and the data description position of the tensor data may be a position of a point or a region in the three-dimensional space to which the tensor data is mapped, which is represented using three-dimensional space coordinates (x, y, z).
It should be understood that shape parameters representing tensor data can be selected by one skilled in the art based on practical considerations, which are not limited by the present disclosure. By using the descriptor in the data access process, the association between the data can be established, thereby reducing the complexity of data access and improving the instruction processing efficiency.
FIG. 2 shows a schematic block diagram of a processing device according to an embodiment of the present disclosure. As shown in fig. 2, the processing device 200 includes a control module 210, an operation module 220, and a storage module 230.
The control module 210 may be configured to control the operation of the processing device 200, such as reading a memory or an externally incoming instruction, decoding (decoding) the instruction, issuing a micro-operation control signal to the corresponding component, and the like. Specifically, the control module 210 may be configured to control the execution unit 220 to perform corresponding processing according to the received instruction. The instructions may include, but are not limited to, data access instructions, arithmetic instructions, descriptor management instructions, synchronization instructions, and the like. The present disclosure is not limited to a particular type of instruction and a particular manner of decoding.
The decoded instruction includes an opcode and an operand. When an instruction involves processing tensor data, at least one operand of the instruction may include at least one descriptor that indicates at least one of the following information: shape information of tensor data and spatial information of tensor data.
The operation module 220 is configured to execute a specific instruction or operation under the control of the control module 210. The operation module 220 may include, but is not limited to, an Arithmetic and Logic Unit (ALU), a Memory Access Unit (MAU), a Neural Functional Unit (NFU), and the like. The present disclosure is not limited to a particular type of hardware for the execution unit.
The storage module 230 may be configured to store various information including, but not limited to, instructions, descriptor-associated information, tensor data, and the like. The memory module 230 may include various memory resources including, but not limited to, internal memory and external memory. The internal memory may include, for example, registers, on-chip SRAM, or other media cache. The external memory may comprise, for example, off-chip memory. The present disclosure is not limited to a particular implementation of the memory module.
Alternatively or additionally, the processing device 200 may further include a Tensor Interface Unit (TIU) 240. The tensor interface unit 240 can be configured to implement operations associated with the descriptors under the control of the control module 210. These operations may include, but are not limited to, registration, modification, deregistration, resolution of descriptors; reading and writing descriptor content, etc. The present disclosure does not limit the specific hardware type of tensor interface unit. In this way, the operation associated with the descriptor can be realized by dedicated hardware, and the access efficiency of tensor data is further improved.
In some embodiments of the present disclosure, tensor interface unit 240 may be configured to parse descriptors included in operands of instructions. For example, the tensor interface unit may parse shape information of tensor data included in the descriptor to determine a data address of data corresponding to the operand in the data storage space.
Although control module 210 and tensor interface unit 240 are shown in fig. 2 as two separate modules, those skilled in the art will appreciate that these two modules/units may also be implemented as one module or more modules, and the present disclosure is not limited in this respect.
The data processing device 200 may be implemented using a general purpose processor (e.g., a central processing unit CPU, a graphics processing unit GPU) and/or a special purpose processor (e.g., an artificial intelligence processor, a scientific computing processor, or a digital signal processor, etc.), and the present disclosure is not limited to a specific type of data processing device.
When hardware executes instructions in parallel, if there is a dependency relationship between the instructions executed in parallel, an error in the execution result may be caused. For example, if two instructions executing in parallel access the same memory location, and at least one of the two instructions is an instruction that writes to the memory location, then a dependency exists between the two instructions, such as a read-after-write dependency, a write-after-write dependency, or a write-after-read dependency. At this point, if the latter instruction executes before the former instruction, an execution error may result. Therefore, the order of execution of these instructions must be guaranteed to be consistent, for example, by enforcing sequential execution, i.e., a subsequent instruction must wait for a previous instruction to complete before execution.
As can be seen from the foregoing description of the tensor data, the tensor data is usually a multidimensional array, and the data size is large, so the instruction processing time for the tensor data is usually longer than that for scalar data. At this time, if the tensor data is processed in the previous sequential execution manner, the processing time is too long, and the efficiency is low. In view of this, in embodiments of the present disclosure, an operation-level instruction parallel scheme is provided in which parallelism of operations is restricted based on a fine-grained region of a shape coordinate space of tensor data for which the operations are directed, so that parallel execution potential of the operations can be mined. Therefore, according to the embodiments of the present disclosure, in the parallel execution of hardware, the consistency of the execution sequence can be ensured, and the degree of parallelism of the operations can be improved, thereby ensuring the accuracy and efficiency of the processing.
FIG. 3A illustrates an exemplary flow chart of a processing method 300 according to an embodiment of the disclosure. The processing method 300 may be implemented, for example, by the processing device 200 of fig. 2.
As shown in FIG. 3A, the method 300 begins in step S301 with a first operation of a fetch instruction. This step may be performed, for example, by control module 210 of fig. 2. In some embodiments, the first operation is an operation on tensor data whose shape coordinate space includes at least one fine-grained region. In some embodiments, the fine-grained region may include one or more neighboring coordinate points of the shape coordinate space of the tensor data. A fine grain region is the smallest unit of operation.
It should be noted that the operations involved in the present disclosure may be basic operations supported by processor hardware, or may be microinstructions (e.g., request signals, etc.) that parse the basic operations. The present disclosure is not limited to a particular type of operation. The processing device of the present disclosure may execute two operations in parallel, or may execute two or more operations in parallel, and the number of operations executed in parallel is not limited in the present disclosure. Two operations executed in parallel may belong to the same instruction or may belong to different instructions, and the present disclosure is not limited in this respect.
When the hardware executes the instructions in parallel, the processor can execute a plurality of operations in parallel, in order to avoid access conflict, when the plurality of operations executed in parallel by the processor are all directed to the same data, the processor will execute only one of the plurality of operations, and simultaneously block other operations, thereby reducing the efficiency of the processor. In the embodiments of the present disclosure, the shape coordinate space of the processed tensor data is further divided into a plurality of fine-grained regions, and it is determined whether or not operations can be performed in parallel based on the fine-grained regions, whereby the efficiency of the processor can be greatly improved.
In some embodiments, the shape, size, and/or number of fine-grained regions may be determined based at least in part on at least one of: the computing power of the hardware; the bandwidth of the hardware; and the size of the shape coordinate space of the tensor data. The hardware computing power may be the amount of data that the hardware processes in parallel in a computing cycle, and the hardware bandwidth may be the data transfer capacity, e.g., the amount of data transferred per unit time.
For example, the processor to which the processing method of the embodiments of the present disclosure is applied has a hardware computation capability of processing 100 bits of data in parallel in one computation cycle, a hardware bandwidth is 200 bits of data transmitted in a unit time, and for two-dimensional tensor data with a size of 100 × 100 bits, a shape coordinate space of the tensor data may be divided into 100 fine-grained regions according to the hardware computation capability, where each fine-grained region includes 100 bits of data; the shape coordinate space may also be divided into 50 fine-grained regions according to hardware bandwidth, where each fine-grained region includes 200 bits of data.
It should be understood that the hardware computing power and hardware bandwidth may vary according to the hardware of the processor, and the present disclosure does not limit the hardware computing power and hardware bandwidth. By the method, the size and/or the number of the fine-grained regions can be determined according to the processing capacity (hardware computing capacity and/or hardware bandwidth) of the processor, so that the division result of the fine-grained regions better meets the requirements of different hardware use environments, the operation executed according to the fine-grained regions tends to be synchronous with the processing capacity of the processor, the execution efficiency of the hardware can be exerted as much as possible, and the processing efficiency of the processor is improved.
The fine particle size regions may be the same or different in shape and size. For example, a first operation may carry the shape and size of a first fine-grained (the coordinate point position and number of each fine-grained region) and may set the shape and size of the first fine-grained to a square of 8 × 8 — 64 coordinate points (assuming a two-dimensional tensor), while a second operation may carry the shape and size of a second fine-grained (e.g., the coordinate point position and number of each fine-grained region) and may set the second fine-grained to a square of 16 — 256 coordinate points. That is, when the first operation is performed, each 8 × 8-64 coordinate point square is regarded as a fine-grained region, and when the second operation is performed, each 16 × 16-256 coordinate point square is regarded as a fine-grained region. Likewise, a first operation may carry a first fine-grained number (e.g., set to 4) while a second operation carries a second fine-grained number (e.g., set to 8). That is, when the first operation is performed, the shape coordinate space is divided into 4 fine-grained regions, and when the second operation is performed, the shape coordinate space is divided into 8 fine-grained regions. It is understood that the operation can also simultaneously carry the parameters of fine-grained shape, size and number. The shape, size and/or number of fine-grained regions may be determined as desired, and is not limited by the present disclosure.
Continuing with fig. 3A, in step S302, it is determined whether there is a second operation on the tensor data in progress.
As mentioned earlier, when the instruction involves processing of tensor data, descriptors are included in the operands by which information relating to the tensor data can be obtained. Thus, in some embodiments, spatial information of the tensor data (e.g., spatial identification ID) from which the dependencies between instructions can be determined can be included in the descriptor. Since different spatial IDs represent the spatial regions pointed to without dependency. Therefore, whether two instructions have dependency relationship, that is, whether to operate on the same tensor data, can be quickly judged according to whether the spatial IDs of the tensor data processed by the two instructions are the same.
In other embodiments, whether there is a second operation being performed on the tensor data may be determined by an occupancy state of a data storage area corresponding to the tensor data. For example, the processor may determine whether the data storage area of the tensor data is occupied by querying the occupation state list, and if the data storage area of the tensor data is occupied, determine that there is a second operation performed on the tensor data. The occupation state list may be preset and stored in the memory, or may be generated before the processor starts to execute a certain task and be logged out after the task is completed. When the occupation state of each data storage area changes, the processor updates the content of the occupation state list to record the occupation state of each data storage area. The present disclosure does not limit the manner of determining whether there is an ongoing second operation on one or more tensor data.
Next, in step S303, when there is such a second operation, it is determined whether there is an overlap of a first fine-grained region to which the first operation is currently directed and a second fine-grained region to which the second operation is currently directed.
The first fine-grained region and the second fine-grained region may be any fine-grained region of a plurality of fine-grained regions in a shape coordinate space of the tensor data. It is understood that the operation on the tensor data is an operation on each fine-grained region in the shape coordinate space of the tensor data. For example, assuming that the tensor data a is a two-dimensional matrix of 8 rows and 16 columns, the shape coordinate space of the tensor data is a two-dimensional space, and 4 columns are fine-grained regions every 2 rows, the shape coordinate space of the tensor data includes 16 fine-grained regions. The write operation for the tensor data a can be regarded as a write operation for the 16 fine granularity areas. The execution process can be that the 1 st fine-grained region (the 1 st to 2 nd rows and the 1 st to 4 th columns) is written, the 2 nd fine-grained region (the 1 st to 2 nd rows and the 5 th to 8 th columns) is written after the 1 st fine-grained region is written, the 3 rd fine-grained region (the 1 st to 2 nd rows and the 9 th to 12 th columns) is written after the 2 nd fine-grained region is written, and so on until the 16 th fine-grained region (the 7 th to 8 th rows and the 13 th to 16 th columns) is written, and the writing operation of the tensor data A is completed. It will be understood by those skilled in the art that operations may also be performed on multiple fine-grained regions at a time, such as writing two or more fine-grained regions at a time, until the operations are completed for all regions.
When there is an operation on the tensor data, the states of the fine-grained regions in the shape coordinate space of the tensor data may include a completed-operated state, an in-progress-operated state, and an unoperated state as the operation is performed; alternatively, for situations where it is not necessary to record whether an operation has completed, the states may include an occupied condition and a usable state. The state of the fine-grained region to which the operation is currently directed is an ongoing operation state or an occupied state. Thus, when there is an operation on tensor data, it can be considered that there is an operation on one fine-grained region in the shape coordinate space of the tensor data, the fine-grained region being operated on or being occupied, that is, the fine-grained region for which the operation is currently directed.
In one possible implementation, the first fine-grained region to which the first operation is currently directed may include a fine-grained region to which the first operation is to be performed, for example, when the operations are initially performed, the first fine-grained region is typically specified to be performed in a predetermined order. The fine-grained region to which the first operation being executed is currently directed may also be included, and may be any one of the fine-grained regions. The second fine-grained region to which the second operation is currently directed may be a fine-grained region to which the second operation being executed is currently directed, and may be any one of the fine-grained regions.
In one possible implementation manner, when it is determined whether there is an ongoing second operation on the tensor data before the first operation performs an operation on the tensor data, the first fine-grained region to which the first operation is currently directed is a fine-grained region to which the first operation is to be performed. For example, before a first operation performs an operation on tensor data, the first fine-grained region to which the first operation is currently directed is typically the first fine-grained region of the shape coordinate space of the tensor data. At this time, the first operation has not yet performed an operation on the first fine-grained region. The second fine-grained region to which the ongoing second operation is currently directed may be associated with a process of execution of the second operation. The second fine-grained region may also be the first fine-grained region of the shape coordinate space of the tensor data if the second operation also just started to be performed. At this time, the first fine-grained region overlaps the second fine-grained region. If the second operation has completed the operation of the first fine-grained region, and the second fine-grained region currently targeted is the pth fine-grained region (P is an integer greater than 1), the first fine-grained region and the second fine-grained region are not overlapped.
In a possible implementation manner, when it is determined whether there is an ongoing second operation on the tensor data in the operation process of the first operation on the tensor data, a first fine-grained region may be determined according to an execution progress of the first operation, a second fine-grained region may be determined according to an execution progress of the second operation, and then it is determined whether the first fine-grained region and the second fine-grained region overlap.
In a possible implementation manner, if the beats of the execution processes of the operations are consistent, it may be determined whether there is an ongoing second operation on the tensor data only before the first operation performs the operation on the tensor data, and it may be determined whether the first fine-grained region and the second fine-grained region overlap. Here, the beat coincidence means that the operation time lengths of two operations for one fine-grained region are the same in the case where the sizes of the fine-grained regions are the same.
In a possible implementation manner, if the beats of the execution processes of the operations are inconsistent or whether the beats are consistent cannot be determined, in the process of the operation on the tensor data by the first operation, after each time the operation on the currently targeted first fine-grained region is completed, whether a second operation on the tensor data is in progress is continuously determined, and whether the first fine-grained region and the second fine-grained region are overlapped or not is continuously determined, so as to determine whether the first operation can be continuously performed or not.
In a possible implementation manner, whether a first fine-grained region currently targeted by a first operation overlaps with a second fine-grained region currently targeted by a second operation may be determined according to a coordinate address, a pointer position, a fine-grained region identifier, and the like. For example, the coordinate address of the current tensor data of each operation may be recorded, a first fine-grained region currently targeted by the first operation and a second fine-grained region currently targeted by the second operation are respectively determined according to the current coordinate address of the first operation and the current coordinate address of the second operation, and the correspondence between the coordinate addresses and the fine-grained regions, and whether the first fine-grained region and the second fine-grained region overlap is further determined. The coordinate addresses and the fine-grained regions are defined based on the shape coordinate space of the tensor data, so that after the fine-grained division of the shape coordinate space is known, the corresponding fine-grained regions can be directly determined from the coordinate addresses. As another example, a pointer may be set for each operation, the pointer pointing to the fine-grained region to which the operation is currently directed. And respectively determining a first fine-grained region currently aimed at by the first operation and a second fine-grained region currently aimed at by the second operation according to the pointer position of the first operation and the pointer position of the second operation, and further judging whether the first fine-grained region and the second fine-grained region are overlapped. For another example, an identifier may be set for each fine-grained region, and whether the first fine-grained region and the second fine-grained region overlap or not may be determined by recording the identifier of the fine-grained region currently targeted by the operation. The indicia may comprise any combination of letters, numbers or symbols. Whether the first fine-grained region and the second fine-grained region overlap can also be judged in other manners, and the judgment basis of whether the first fine-grained region and the second fine-grained region overlap is not limited in the present disclosure.
Next, in step S304, when there is no overlap between the first fine-grained region and the second fine-grained region, a first operation is performed.
In a possible implementation manner, if a first fine-grained region to which a first operation is currently directed does not overlap with a second fine-grained region to which a second operation is currently directed, the first fine-grained region may be a fine-grained region in which the second operation has already been operated, or a fine-grained region in which the second operation does not need to be operated, and at this time, executing the first operation does not affect an operation process and an operation result of the second operation, and the first operation may be executed.
According to the present embodiment, when the shape coordinate space of the tensor data targeted by the first operation includes at least one fine-grained region and there is an ongoing second operation on the tensor data, it is possible to determine whether there is an overlap between a first fine-grained region currently targeted by the first operation and a second fine-grained region currently targeted by the second operation, and when there is no overlap between the two, the first operation is performed. Therefore, the fine-grained regions of the current operations of the first operation and the second operation can be executed without overlapping, so that the first operation and the second operation can simultaneously operate the same tensor data, and the processing efficiency of the processor is improved.
In one possible implementation, the processing method 300 may further include: the first operation is blocked when the first fine-grained region overlaps the second fine-grained region.
In one possible implementation, the first fine-grained region overlaps the second fine-grained region, including the first fine-grained region completely overlapping or partially overlapping the second fine-grained region. When the first fine-grained region and the second fine-grained region overlap, if the first operation is executed, the first operation is directed at the operation of the overlapping part region, which may affect the execution of the second operation to cause an inaccurate operation result of the second operation, and may also affect the execution of the first operation to cause an inaccurate operation result of the first operation. At this time, the first operation may be blocked, that is, the execution of the first operation may be suspended, and the first operation may be executed after the second operation completes the operation on the second fine-grained region currently targeted by the second operation. I.e., the first fine-grained region does not overlap the second fine-grained region, the first operation is performed.
In this embodiment, when the first fine-grained region and the second fine-grained region overlap, the first operation is blocked, so that operation errors and inaccurate operation results caused by the overlap of the fine-grained regions of the operations can be avoided, and the correctness of the operations is ensured.
In some embodiments, at least one of the first operation and the second operation may be a write operation. That is, when the operation on the target data is read after write (the second operation is a write operation, the first operation is a read operation), write after read (the second operation is a read operation, the first operation is a write operation), or write after write (both the second operation and the first operation are write operations), there is a dependency relationship between the two operations, and the method in the embodiment of the present disclosure may be adopted at this time. In the embodiments, by dividing the shape coordinate space of the target data into one or more fine-grained regions and executing operations in units of the fine-grained regions, operations such as write after read, read after write, write after write, and the like can be correctly executed, an accurate execution result is obtained, the waiting time between operations can be reduced, and the execution efficiency of the processor is improved.
In the disclosed embodiment, based on such fine-grained region division of the tensor data shape coordinate space, there is also provided a processing method of determining the execution range of an operation based on the coordinate space range expressed by the fine-grained region.
FIG. 3B schematically illustrates an exemplary flow diagram of a processing method according to an embodiment of the disclosure. Likewise, the processing method of fig. 3B may be implemented by, for example, the processing device 200 of fig. 2.
As shown in fig. 3B, in step S311, the first coordinate space range of the tensor number which allows the first operation to use is determined. This step may be performed, for example, by control module 210 of fig. 2. The first coordinate space range may be, for example, a portion of a shape coordinate space of tensor data involved in the first operation.
Next, in step S312, a second coordinate space range of tensor data to be used when the first operation is performed is determined. This step may be performed, for example, by execution unit 220 of fig. 2. The second coordinate space range may be, for example, a portion of the shape coordinate space of the tensor data involved in the first operation.
Finally, in step S313, the first operation is performed within a third coordinate space range determined by the intersection of the first coordinate space range and the second coordinate space range. This step may be performed, for example, by execution unit 220 of fig. 2.
The first coordinate space range, the second coordinate space range, and the third coordinate space range may all be expressed based on a fine-grained region in the shape coordinate space of the tensor data, that is, the first, second, and third coordinate space ranges may be characterized in units of fine-grained regions.
In the embodiments of the present disclosure, by limiting the coordinate space range of tensor data that can be used when an operation is performed, for example, limiting the operation to be performed within the third coordinate space range as above, it is possible to ensure that accesses of instructions on each coordinate space range are sequential when instructions are performed in parallel, thereby ensuring accuracy and efficiency of processing. Further, since the software-side programming generally uses spatial coordinates to refer to data points or data blocks in the tensor data, the parallel execution of the operations is constrained by the coordinate space range of the tensor data, so that the software-side code programming can be simplified, and the execution of the instructions is facilitated.
In some embodiments, the overlap determination of the fine-grained regions described above in connection with fig. 3A and 3B may be performed only under certain conditions, thereby reducing determination time and speeding up instruction processing.
FIG. 3C schematically illustrates an exemplary flow chart of a processing method according to another embodiment of the disclosure.
As shown in fig. 3C, in step S321, a first operation of the instruction is acquired. In some embodiments, the first operation is an operation on tensor data, which may include descriptors of the tensor data in its operands.
Next, in step S322, it is determined whether or not there is a second operation being performed on the tensor data. This operation is similar to step S302 described above in connection with fig. 3A and is not repeated here.
If it is determined that no such second operation exists, the method may jump directly to step S326, i.e. perform the first operation. This means that there is no second operation that may conflict with the first operation, and thus the first operation can be performed immediately. When there are other operations being performed, the first operation is now performed in parallel with these other operations.
If it is determined that there is such a second operation, i.e., a conflict may occur, the method may proceed to step S323, where it is further determined whether the data operation ranges of the first operation and the second operation overlap. It will be appreciated that, because tensor data is typically large in dimension, the range of data operations for which different operations are directed may be different. When there is no overlap between the data operation ranges of different operations, the first operation can be executed in parallel with the preceding second operation without conflict.
Whether the data operation ranges of the operations overlap may be determined in a variety of ways. In some embodiments, whether the data manipulation ranges overlap may be determined based on spatial information and/or shape information of tensor data to be manipulated. The spatial information and the shape information of the tensor data can be referred to the detailed description, and are not repeated here. Shape information of tensor data can be used to determine the access address of an operation to determine if there is overlap between the data operation ranges of two operations. The access address may be a coordinate space address of the tensor data or a storage space address of the tensor data, and the present disclosure is not limited in this respect.
If it is determined in step S323 that the data operation ranges of the first operation and the second operation do not overlap, the method may jump to step S326, i.e., perform the first operation. This means that even if the first operation and the second operation access the same tensor data (determined at step S322), the first operation may be performed in parallel with the second operation as long as the data operation ranges of the first operation and the second operation do not overlap, that is, the mutually non-overlapping portions of the same tensor data are accessed, respectively.
If it is determined in step S323 that the data operation ranges of the first operation and the second operation overlap, the method may proceed to step S324, where it is further determined whether the fine-grained regions for which the first operation and the second operation are currently directed overlap. The specific determination method can refer to the description above with reference to fig. 3A and 3B.
When the step S324 determines that the data operation ranges of the first operation and the second operation do not overlap, the method may proceed to step S326, i.e., perform the first operation. Therefore, whether the current fine-grained regions are overlapped or not can be dynamically judged based on the dynamic execution of the operation, so that the parallel execution of the operation is realized at the level of the fine-grained regions, and the parallel potential of the operation is furthest mined.
If the data operation ranges of the two operations overlap as determined in step S324, the first operation cannot be executed at this time, otherwise, a conflict may be caused. Therefore, in step S325, the first operation is blocked.
In the embodiment of fig. 3C, by performing static pre-determination based on the data operation range of the operation first, the overlap determination of the dynamic fine-grained region is performed only under specific conditions (i.e., when the data operation ranges overlap), which can effectively shorten the determination time and speed up the instruction processing.
The present disclosure also provides exemplary processing devices for implementing the processing methods of fig. 3A, 3B, and 3C. Fig. 3D shows a schematic functional block diagram of a processing device according to an embodiment of the present disclosure.
As shown in fig. 3D, the processing device 30 includes an operation acquisition unit 31, a first determination unit 32, a second determination unit 33, and an execution unit 34.
The operation acquisition unit 31 configures a first operation for acquiring an instruction. The first operation is an operation on tensor data whose shape coordinate space includes at least one fine-grained region, each fine-grained region including one or more neighboring coordinate points of the shape coordinate space. The first determination unit 32 is configured to determine whether there is an ongoing second operation for the tensor data. The second determination unit 33 is configured to determine whether there is an overlap between a first fine-grained region currently targeted by the first operation and a second fine-grained region currently targeted by the second operation, when such a second operation exists. The execution unit 34 is configured to execute the first operation when the first fine-grained region and the second fine-grained region do not overlap.
In some embodiments, the second determining unit 33 may include a first determining sub-unit 331 and a second determining sub-unit 332. The first determination subunit 331 is configured to determine a first coordinate space range of the tensor data that allows the first operation to use. The second determining subunit 332 is configured to determine a second coordinate space range of the tensor data to be used when the first operation is performed. In these embodiments, the execution unit 34 may be configured to perform the first operation within a third coordinate space range determined by an intersection of the first coordinate space range and the second coordinate space range. In these embodiments, the first, second, and third coordinate space ranges are characterized using a fine-grained region in a shape coordinate space of the tensor data.
In some embodiments, the processing device 30 may further comprise a blocking unit 35 and a third determining unit 36. The blocking unit 35 may be configured to block the first operation to prevent a conflict from occurring when it is determined that the first operation overlaps with a fine-grained region for which the second operation is currently directed.
In some embodiments, the third determination unit 36 may be configured to make a preliminary static determination, i.e., determine whether the data operation ranges of the first operation and the second operation overlap. The judgment of the second determining unit 33 is made only when the data operation ranges overlap. The execution unit 34 may execute the first operation according to the judgment results of the respective determination units.
Those skilled in the art will appreciate that the various elements shown in fig. 3D are divided according to functional implementation. This division is merely exemplary, and in actual implementation, two or more functions may be implemented in the same hardware unit, and one function may also be distributed in two hardware units. For example, in one implementation, the operation acquiring unit 31 and the first determining unit 32 and the optional third determining unit 36 may be included in the control module 210 of the processing apparatus 200 shown in fig. 2, and the second determining unit 33 and the executing unit 34 may be included in the operation module 220 of the processing apparatus 200. For another example, in another implementation, the operation acquiring unit 31, the first determining unit 32, the second determining unit 33 and the optional third determining unit 36 may be included in the control module 210 of the processing apparatus 200 shown in fig. 2, and the executing unit 34 is included in the operation module 220 of the processing apparatus 200.
It should also be understood that the units comprised in the processing device 30 correspond to the respective steps in the method described with reference to fig. 3A, 3B and 3C. Thus, the operations and features described above for the method are equally applicable to the processing device 30 and the units included therein, and are not described in detail here.
FIG. 4 schematically illustrates the division of a coordinate space range according to an embodiment of the disclosure. Fig. 4 is an exemplary illustration of two-dimensional data, however, one skilled in the art will appreciate that the same approach may be similarly applied to tensor data of three or even more dimensions.
As shown in fig. 4, the shape coordinate space 400 of the two-dimensional tensor data is divided into 12 fine- grained regions 4001, 4002, …, 4011, and 4012, respectively. On each fine-grained region, access to it is guaranteed to be sequential. Any data element (e.g., data point) on the tensor data may be represented by two-dimensional spatial coordinates (X, Y) (where the X-axis is horizontally to the right and the Y-axis is vertically down). It is apparent that the coordinates of any data element on the tensor data do not exceed the maximum size of the shape coordinate space.
In some embodiments, all fine-grained regions in the shape coordinate space of the tensor data that are not currently used by a previous operation associated with the first operation may be determined as the first coordinate space range.
In these embodiments, for example, when the previous operation is using the fine- grained regions 4004, 4008, 4009-.
Alternatively or additionally, in some embodiments, the fine-grained region determined based on the coordinates of the tensor data to be accessed by the first operation is determined as the second coordinate space range.
In these embodiments, for example, when it is expected that the first operation will use a fine-grained region other than the fine-grained region 4001 and the fine-grained region 4002 (estimated from the coordinates of tensor data to be accessed, for example), the spatial range (i.e., the second coordinate spatial range) that will be used when the first operation is performed may be determined as the fine-grained region 4003 + 4012, as the region shown by the dot-filling.
Then, according to the embodiment of the present disclosure, the range in which the first operation can be performed in practice, that is, the third coordinate space range, is an intersection of the first coordinate space range and the second coordinate space range. As shown in fig. 4, in the current example, the third coordinate space range is an area where both diagonal shading and point filling exist, that is, fine- grained areas 4003 and 4005 and 4007 in fig. 4.
In some embodiments, the first, second, and third coordinate space ranges may be directly characterized using the identification of the respective included fine-grained regions. For example, in the example shown in fig. 4, the first coordinate space range may be characterized using the identification of the fine- grained region 4001 and 4003 and the fine- grained region 4005 and 4007; the second coordinate space range can be characterized by the identification of the fine- grained region 4003 and 4012; the third coordinate space range may be characterized using the identification of fine- grained regions 4003 and 4005 and 4007.
Considering that in most cases, the tensor data is accessed by traversing the data unit at each coordinate point in the tensor data from front to back with gradually increasing coordinates according to a certain dimension.
Thus, in other embodiments, the first coordinate space range is characterized using an upper bound of coordinates in one or more dimensions of the tensor data that allows the fine-grained region used by the first operation; and/or the second coordinate space range is characterized using a lower coordinate bound on one or more dimensions of the tensor data for the fine-grained region expected to be used by the first operation. By exploiting this property of tensor data that is accessed in a dimensional order, only the upper or lower coordinate bound can be used to characterize the first or second coordinate space range, whereby the control information and corresponding control methods can be simplified.
Still taking fig. 4 as an example, as previously described, the first coordinate space range may be characterized using a fine-grained region of coordinate upper bounds located in one or more dimensions of the tensor data that allows the first operation to use. For example, when the previous operation is using the rightmost column and the bottommost row for a total of 6 fine-grained regions, the spatial range allowed for use by the first operation (i.e., the first coordinate spatial range) at this time may include two rows and three columns on the top left for a total of 6 fine-grained regions, as indicated by the cross-hatching. At this time, in fig. 4, the first coordinate space range may be characterized by a fine-grained region where an X upper bound 411 on the X-axis and a Y upper bound 421 on the Y-axis are located. In this example, the first coordinate space range may be characterized using fine- grained regions 4003 and 4005, which indicate that the data coordinates accessed by the first operation cannot exceed fine-grained region 4003 in the X-dimension and cannot exceed fine-grained region 4005 in the Y-dimension.
Similarly, the second coordinate space range may be characterized using a fine-grained region located at a lower boundary of coordinates in one or more dimensions of the tensor data that is expected to be used by the first operation. For example, when it is determined that the first operation will use fine-grained regions other than the left two fine-grained regions of the first row based on the coordinates of the tensor data to be accessed by the first operation, it may be determined that the second coordinate space range includes the remaining 10 fine-grained regions, as shown by the dot-filled portion. At this point, in FIG. 4, the second coordinate space range may be characterized by a fine-grained region in which the lower X bound 412 on the X-axis and the lower Y bound 422 on the Y-axis are located. In this example, the second coordinate space range may be characterized using fine- grained regions 4002 and 4001, which indicates that data in the tensor data having an X dimension lower than fine-grained region 4002 and a Y dimension lower than fine-grained region 4001 is not accessed when the first operation is performed.
The range in which the first operation can be actually performed is a third coordinate space range, which is an intersection of the first coordinate space range and the second coordinate space range. As shown in fig. 4, in the present example, the third coordinate space range is an area where both diagonal shading and dot filling exist, that is, an "inverted L-shaped" area in fig. 4.
The first coordinate space range and the second coordinate space range may be determined in a number of ways.
In some embodiments, the first coordinate space range may be determined taking into account, for example, additionally at least one of: the sequence of the operations; operands involved in the operation; a second coordinate space range of a previous operation. For example, in an embodiment in which the range of coordinate space is characterized by an upper and a lower boundary of coordinate space, the lower boundary of coordinate space of the tensor data used by the preceding operation or instruction may be used as the upper boundary of the tensor data used by the current new instruction.
In one example, when the first operation (i.e., the current operation) is a read operation, the upper coordinate space boundary is the lower coordinate space boundary of the nearest (i.e., preceding or preceding) write operation to the tensor data.
In another example, when the first operation is a write operation, the upper coordinate space boundary is the minimum of the lower coordinate space boundary of the nearest write operation to the tensor data and the lower coordinate space boundary of all read operations to the tensor data between the two write operations. By choosing the minimum value it is ensured that the execution of the first operation does not affect the execution of any of the preceding operations.
Alternatively or additionally, the second coordinate space range may be determined based on at least one of: the execution range of the operation; an access mode of operation; and the current execution state of the operation. For example, in an embodiment in which the coordinate space range is represented by an upper bound and a lower bound of the coordinate space, the second coordinate space range may be determined by comprehensively considering the above factors, and it is ensured that when tensor data is accessed according to a dimension, a coordinate on the corresponding dimension is not smaller than the lower bound of the coordinate space. Further, trying to provide a maximum value for the lower bound of the coordinate space leaves a larger range of accessible space for subsequent operations or instructions.
In one example, when the access mode of the first operation is sequential and continuous access, the lower coordinate space bound may be determined based on the minimum access coordinate of the first operation. For example, the lower boundary of the coordinate space may be determined as the fine-grained region where the minimum access coordinates are located. As shown in fig. 4, when the first operation accesses data according to the X dimension, assuming that the minimum X coordinate of the accessed data is a, and it is located in the 2 nd fine-grained region from the left, the X lower bound may be determined as the 2 nd fine-grained region; when the first operation accesses data according to the Y dimension, assuming that the minimum Y coordinate of the accessed data is B and falls within the 3 rd fine-grained region from the top, the Y lower bound may be determined as the 3 rd fine-grained region.
In another example, when the access mode of the first operation is regular access, the coordinate space lower bound may be determined based on the law. For example, in a convolution operation, block access tensor data may be required, and thus the lower boundary of the coordinate space may be determined according to the block law of the convolution operation.
In yet another example, when the access mode of the first operation cannot be determined, the coordinate space lower bound may be determined based on a predetermined setting. For example, the coordinate space lower bound may be a default value, such as 0 or 1 or more fine-grained region sizes.
In some embodiments, the first and second coordinate space ranges may be determined based on a pre-partitioning of a shape coordinate space of the tensor data. Specifically, the shape coordinate space of the tensor data may be first divided into several spatial blocks, e.g., uniformly or non-uniformly divided in various dimensions, each spatial block including one or more fine-grained regions. For example, still referring to fig. 4, the shape coordinate space of the tensor data is divided into 6 spatial blocks, each of which A, B, C, D, E and F includes 2 fine-grained regions.
In these embodiments, a space block in the shape coordinate space of the tensor data, for which a previous operation has been completed, may be determined as the first coordinate space range; and determining a spatial block determined based on coordinates of tensor data to be accessed by the first operation as a second coordinate spatial range.
For example, when a previous operation has completed accessing space block A, B, while space block C is being used, the space range that the first operation is permitted to use at this time (i.e., the first coordinate space range) may include space block A and space block B. Further, for example, when it is expected that the first operation will use the space block a and the space block B (for example, estimated from the coordinates of tensor data to be accessed), the space range (i.e., the second coordinate space range) to be used when performing the first operation may be determined as the space block a and the space block B.
Then, according to the embodiment of the present disclosure, the range in which the first operation can be performed in practice, that is, the third coordinate space range, is an intersection of the first coordinate space range and the second coordinate space range. In the current example, the third coordinate spatial range is spatial blocks a and B.
Alternatively or additionally, in some embodiments, within the third coordinate space range, the first operation may be performed based on at least one of the following order: a predetermined spatial block order; and/or a predetermined fine-grained region order.
In some implementations, after the shape coordinate space of the tensor data to be operated on is pre-partitioned into blocks, e.g., the six spatial blocks of fig. 4, a spatial block order, i.e., an order of operation for each spatial block in the coordinate space, e.g., in the order of spatial blocks A, B, C, D, E and F, may be predetermined. In this case, if the operand or the used space of two instructions having a dependency relationship is the entire tensor data, the instructions can be made to operate on the space blocks one by one in this order. For example, assuming that the preceding instruction 1 is to write tensor data and the following instruction 2 is to read the tensor data, instruction 1 may write to spatial block a first and then write to spatial block B. At this point, instruction 2 may be allowed to begin reading space block A. If the spatial blocks are divided so that the execution beats of the instruction 2 are consistent with the execution beats of the instruction 1, in the subsequent time, when the instruction 1 starts to write the spatial block C, the instruction 2 also finishes the reading operation on the spatial block A and starts to read the spatial block B; and so on. Therefore, the division of the space blocks is beneficial to the parallel execution of the instructions, and the arrangement of the sequence of the space blocks is beneficial to simplifying the operation scheduling, shortening the processing time and improving the processing efficiency.
Alternatively or additionally, in some implementations, the first operation may also be performed in a predetermined fine-grained region order when performed within a single spatial block. When the operation range of the instructions executed in parallel is further controlled based on the fine-grained region of the current operation in a single spatial block, the manner of executing according to the predetermined fine-grained region sequence is beneficial to simplifying operation scheduling, and the principle of the method is similar to the principle of executing according to the predetermined spatial block sequence described above, and is not described again here.
In still other embodiments, the first and second coordinate space ranges may also be determined by combining the dynamic determination with a pre-partitioning of the tensor data shape coordinate space. Specifically, the shape coordinate space of the tensor data may be first divided into spatial blocks, e.g., uniformly or non-uniformly divided in various dimensions. Then, within each spatial block, first and second coordinate spatial ranges may be dynamically determined based on the performance of the operation. For a specific determination, reference may be made to the foregoing description, which is not repeated herein. In these implementations, when the precise location of the second coordinate spatial range within a spatial block cannot be determined, the range to which the spatial block corresponds may be defaulted.
In some embodiments, the pre-partitioning of the shape coordinate space of the tensor data may be based on at least one of: the processing capacity of the hardware; presetting parameters; and the size of the shape coordinate space of the tensor data. The processing capability of the hardware may include, for example, but is not limited to, the bit width of data that the hardware is capable of processing. The shape coordinate space of the tensor data is divided based on the data bit width which can be processed by hardware, so that the processing capacity of the hardware can be fully exerted, and the parallel processing efficiency is improved. The preset parameters may directly specify, for example, the number of spatial blocks to be divided, the size of each dimension of the spatial blocks, and the like. The shape coordinate space of the tensor data is divided based on the size/dimension of the shape coordinate space of the tensor data. For example, when the tensor data is a two-dimensional matrix with a size of M rows by N columns (M, N are positive integers), each row may be divided into M parts in parallel and each column into N parts in parallel, thereby totaling M by N spatial blocks.
Although six evenly divided spatial blocks are shown in fig. 4, it is also possible to divide the blocks into various numbers of spatial blocks of unequal size, and the present disclosure is not limited in the particular manner of division. The above describes a scheme of constraining the spatial range actually used by the operations in order to ensure the order consistency of data processing and improve the parallel processing efficiency when the hardware executes the operations in parallel. Those skilled in the art will appreciate that the current operation (e.g., the aforementioned first operation) and the prior operation (or the preceding operation) may each be operations in different instructions that are executed in parallel; and the current operation and the prior operation may each be different operations executed in parallel in the same instruction, and the disclosure is not limited in this respect.
The processing method performed by the processing apparatus of the embodiment of the present disclosure has been described above with reference to the flowchart. As can be understood by those skilled in the art, because the operations executed in parallel are restricted based on the coordinate space range of the processed data, the order consistency of the operation execution is ensured, and the parallelism degree of the operations is improved, so that the processing efficiency is improved. It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.
It is further noted that, although the steps in the method flow chart are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in the method flow diagrams may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
Fig. 5 is a block diagram illustrating a combined processing device 500 according to an embodiment of the present disclosure. As shown in fig. 5, the combined processing device 500 includes a computing processing device 502, an interface device 504, other processing devices 506, and a storage device 508. Depending on the application scenario, one or more computing devices 510 may be included in the computing processing device, and may be configured as the processing device 200 shown in fig. 2 to perform the operations described herein in conjunction with fig. 4.
In various embodiments, the computing processing device of the present disclosure may be configured to perform user-specified operations. In an exemplary application, the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core. When multiple computing devices are implemented as artificial intelligence processor cores or as part of a hardware structure of an artificial intelligence processor core, computing processing devices of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure.
In an exemplary operation, the computing processing device of the present disclosure may interact with other processing devices through an interface device to collectively perform user-specified operations. Other Processing devices of the present disclosure may include one or more types of general and/or special purpose processors, such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), and artificial intelligence processors, depending on the implementation. These processors may include, but are not limited to, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic, discrete hardware components, etc., and the number may be determined based on actual needs. As previously mentioned, the computing processing device of the present disclosure may be considered to have a single core structure or an isomorphic multi-core structure only. However, when considered together, a computing processing device and other processing devices may be considered to form a heterogeneous multi-core structure.
In one or more embodiments, the other processing device can interface with external data and controls as a computational processing device of the present disclosure (which can be embodied as an artificial intelligence, e.g., a computing device associated with neural network operations), performing basic controls including, but not limited to, data handling, starting and/or stopping of the computing device, and the like. In further embodiments, other processing devices may also cooperate with the computing processing device to collectively perform computational tasks.
In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing processing device may obtain input data from other processing devices via the interface device, and write the input data into a storage device (or memory) on the computing processing device. Further, the computing processing device may obtain control instructions from other processing devices via the interface device, and write the control instructions into a control cache on the computing processing device slice. Alternatively or optionally, the interface device may also read data from the memory device of the computing processing device and transmit the data to the other processing device.
Additionally or alternatively, the combined processing device of the present disclosure may further include a storage device. As shown in the figure, the storage means is connected to the computing processing means and the further processing means, respectively. In one or more embodiments, the storage device may be used to hold data for the computing processing device and/or the other processing devices. For example, the data may be data that is not fully retained within internal or on-chip storage of a computing processing device or other processing device.
In some embodiments, the present disclosure also discloses a chip (e.g., chip 602 shown in fig. 6). In one implementation, the Chip is a System on Chip (SoC) and is integrated with one or more combinatorial processing devices as shown in fig. 5. The chip may be connected to other associated components through an external interface device, such as external interface device 606 shown in fig. 6. The relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card, or a wifi interface. In some application scenarios, other processing units (e.g., video codecs) and/or interface modules (e.g., DRAM interfaces) and/or the like may be integrated on the chip. In some embodiments, the disclosure also discloses a chip packaging structure, which includes the chip. In some embodiments, the present disclosure also discloses a board card including the above chip packaging structure. The board will be described in detail below with reference to fig. 6.
Fig. 6 is a schematic diagram illustrating a structure of a board card 600 according to an embodiment of the disclosure. As shown in fig. 6, the board includes a memory device 604 for storing data, which includes one or more memory cells 610. The memory device may be coupled to and communicate data with control device 608 and chip 602 described above, for example, via a bus. Further, the board card further includes an external interface device 606 configured for data relay or transfer function between the chip (or the chip in the chip package structure) and an external device 612 (such as a server or a computer). For example, the data to be processed may be transferred to the chip by an external device through an external interface means. For another example, the calculation result of the chip may be transmitted back to an external device via the external interface device. According to different application scenarios, the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface or the like.
In one or more embodiments, the control device in the disclosed card may be configured to regulate the state of the chip. Therefore, in an application scenario, the control device may include a single chip Microcomputer (MCU) for controlling the operating state of the chip.
From the above description in conjunction with fig. 5 and 6, it will be understood by those skilled in the art that the present disclosure also discloses an electronic device or apparatus, which may include one or more of the above boards, one or more of the above chips and/or one or more of the above combination processing devices.
According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud end, an edge end, and a terminal. In one or more embodiments, a computationally powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or apparatus may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.
It is noted that for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of acts and combinations thereof, but those skilled in the art will appreciate that the aspects of the present disclosure are not limited by the order of the acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in which acts or modules are involved, which are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.
In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are divided based on the logic functions, and there may be other dividing manners in actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.
In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the solution of the embodiment of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.
In some implementation scenarios, the integrated units may be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated units may be stored in a computer readable memory. In this regard, when aspects of the present disclosure are embodied in the form of a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described in embodiments of the present disclosure. The Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors and like devices. In view of this, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.
The foregoing may be better understood in light of the following clauses:
clause 1. a method of treatment, the method comprising:
obtaining a first operation of an instruction, the first operation being an operation on tensor data, a shape coordinate space of the tensor data comprising at least one fine-grained region, the fine-grained region comprising one or more neighboring coordinate points of the shape coordinate space;
determining whether there is an ongoing second operation on the tensor data;
when the second operation exists, determining whether a first fine-grained region currently aimed at by the first operation and a second fine-grained region currently aimed at by the second operation overlap; and
performing the first operation when the first fine-grained region and the second fine-grained region do not overlap.
Clause 2. the method of clause 1, wherein the method further comprises:
blocking the first operation when there is overlap of the first fine-grained region and the second fine-grained region.
Clause 3. the method of any of clauses 1-2, wherein the method further comprises:
determining whether a data operation range of the first operation overlaps with a data operation range of the second operation;
when the data operation ranges overlap, performing the determination whether there is overlap between a first fine-grained region to which the first operation is currently directed and a second fine-grained region to which the second operation is currently directed; and
and when the data operation ranges do not overlap, executing the first operation.
Clause 4. the method of clause 3, wherein determining whether the data operating range of the first operation overlaps the data operating range of the second operation is based on at least one of:
spatial information of tensor data to be operated; and/or
Shape information of tensor data to be operated on.
Clause 5. the method of any of clauses 1-4, wherein the method further comprises:
determining a first coordinate space range of the tensor data that allows use of the first operation;
determining a second coordinate space range of the tensor data to be used when performing the first operation; and
performing the first operation within a third coordinate space range determined by an intersection of the first coordinate space range and the second coordinate space range;
wherein the first, second, and third coordinate space ranges are characterized using the fine-grained region.
Clause 6. the method of clause 5, wherein the first coordinate space range is determined based on at least one of:
the sequence of the operations;
operands involved in the operation;
a second coordinate space range of a previous operation; and
a predetermined division of a shape coordinate space of the tensor data.
Clause 7. the method of any of clauses 5-6, wherein the second coordinate space range is determined based on at least one of:
the execution range of the operation;
an access mode of operation;
a current execution state of the operation; and
a predetermined division of a shape coordinate space of the tensor data.
Clause 8. the method of any of clauses 5-7, wherein:
determining the first coordinate space range comprises: determining an upper bound of a coordinate space of one or more dimensions of the tensor data that the first operation is allowed to use; and/or
Determining the second coordinate space range comprises: determining a lower bound of a coordinate space for one or more dimensions of the tensor data expected to be used by the first operation.
Clause 9. the method of any of clauses 1-8, wherein the size and/or number of fine-grained regions is determined based at least in part on at least one of:
the computing power of the hardware;
the bandwidth of the hardware; and
a size of a shape coordinate space of the tensor data.
Clause 10. the method of any of clauses 1-9, wherein at least one of the first operation and the second operation is a write operation.
Clause 11. the method of any of clauses 1-10, wherein:
the first operation and the second operation are respectively operations in different instructions executed in parallel; or
The first operation and the second operation are respectively different operations executed in parallel in the same instruction.
Clause 12. a processing apparatus, comprising:
an operation acquisition unit configured to acquire a first operation of an instruction, the first operation being an operation on tensor data whose shape coordinate space includes at least one fine-grained region including one or more adjacent coordinate points of the shape coordinate space;
a first determination unit configured to determine whether there is an ongoing second operation for the tensor data;
a second determining unit, configured to determine, when the second operation exists, whether there is an overlap between a first fine-grained region currently targeted by the first operation and a second fine-grained region currently targeted by the second operation; and
an execution unit configured to execute the first operation when the first fine-grained region and the second fine-grained region do not overlap.
Clause 13. the processing device of clause 12, wherein the processing device further comprises:
a blocking unit configured to block the first operation when there is an overlap of the first fine-grained region and the second fine-grained region.
Clause 14. the processing apparatus according to any of clauses 12-13, wherein the processing apparatus further comprises:
a third determination unit configured to determine whether a data operation range of the first operation overlaps with a data operation range of the second operation; and is
The second determination unit is configured to perform the determination whether there is an overlap between a first fine-grained region to which the first operation is currently directed and a second fine-grained region to which the second operation is currently directed, when the third determination unit determines that the data operation ranges overlap; and
the execution unit is configured to execute the first operation when the third determination unit determines that the data operation ranges do not overlap.
Clause 15. the processing apparatus according to clause 14, wherein the third determining unit determines whether the data operation range of the first operation and the data operation range of the second operation overlap based on at least one of:
spatial information of tensor data to be operated; and/or
Shape information of tensor data to be operated on.
Clause 16. the processing apparatus according to clauses 12-15, wherein the second determining unit further comprises:
a first determination subunit configured to determine a first coordinate space range of the tensor data that allows the first operation to use;
a second determination subunit configured to determine a second coordinate space range of the tensor data to be used when the first operation is performed; and is
The execution unit is further configured to perform the first operation within a third coordinate space range determined by an intersection of the first coordinate space range and the second coordinate space range,
wherein the first, second, and third coordinate space ranges are characterized using the fine-grained region.
Clause 17. the processing apparatus of clause 16, wherein the first coordinate space range is determined based on at least one of:
the sequence of the operations;
operands involved in the operation;
a second coordinate space range of prior operation; and
a predetermined division of a shape coordinate space of the tensor data.
Clause 18. the processing apparatus according to any of clauses 16-17, wherein the second coordinate space range is determined based on at least one of:
the execution range of the operation;
an access mode of operation;
a current execution state of the operation; and
a predetermined division of a shape coordinate space of the tensor data.
Clause 19. the processing apparatus according to any of clauses 16-18, wherein:
the first determining subunit is further configured to: determining an upper bound of a coordinate space of one or more dimensions of the tensor data allowed for use by the first operation; and/or
The second determining subunit is further configured to: determining a lower bound of a coordinate space for one or more dimensions of the tensor data expected to be used by the first operation.
Clause 20. the processing apparatus of any of clauses 12-19, wherein the size and/or number of fine-grained regions is determined based at least in part on at least one of:
the computing power of the hardware;
the bandwidth of the hardware; and
a size of a shape coordinate space of the tensor data.
Clause 21. the processing apparatus of any of clauses 12-20, wherein at least one of the first operation and the second operation is a write operation.
Clause 22. the processing apparatus of any of clauses 12-21, wherein:
the first operation and the second operation are respectively operations in different instructions executed in parallel; or
The first operation and the second operation are respectively different operations executed in parallel in the same instruction.
Clause 23. a chip, characterized in that it comprises a processing device according to any of clauses 12-22.
Clause 24, a card, wherein the card comprises the chip of clause 23.

Claims (24)

1. A method of processing, the method comprising:
obtaining a first operation of an instruction, the first operation being an operation on tensor data, a shape coordinate space of the tensor data comprising at least one fine-grained region, the fine-grained region comprising one or more neighboring coordinate points of the shape coordinate space;
determining whether there is an ongoing second operation on the tensor data;
when the second operation exists, determining whether a first fine-grained region currently aimed at by the first operation and a second fine-grained region currently aimed at by the second operation overlap; and
performing the first operation when the first fine-grained region and the second fine-grained region do not overlap.
2. The method of claim 1, wherein the method further comprises:
blocking the first operation when there is overlap of the first fine-grained region and the second fine-grained region.
3. The method according to any of claims 1-2, wherein the method further comprises:
determining whether a data operation range of the first operation overlaps with a data operation range of the second operation;
when the data operation ranges overlap, performing the determination whether there is overlap between a first fine-grained region to which the first operation is currently directed and a second fine-grained region to which the second operation is currently directed; and
and when the data operation ranges do not overlap, executing the first operation.
4. The method of claim 3, wherein determining whether the data operation range of the first operation overlaps the data operation range of the second operation is based on at least one of:
spatial information of tensor data to be operated; and/or
Shape information of tensor data to be operated on.
5. The method according to any one of claims 1-4, wherein the method further comprises:
determining a first coordinate space range of the tensor data that allows use of the first operation;
determining a second coordinate space range of the tensor data to be used when performing the first operation; and
performing the first operation within a third coordinate space range determined by an intersection of the first coordinate space range and the second coordinate space range;
wherein the first, second, and third coordinate space ranges are characterized using the fine-grained region.
6. The method of claim 5, wherein the first coordinate space range is determined based on at least one of:
the sequence of the operations;
operands involved in the operation;
a second coordinate space range of prior operation; and
a predetermined division of a shape coordinate space of the tensor data.
7. The method according to any of claims 5-6, wherein the second coordinate space range is determined based on at least one of:
the execution range of the operation;
an access mode of operation;
a current execution state of the operation; and
a predetermined division of a shape coordinate space of the tensor data.
8. The method of any of claims 5-7, wherein:
determining the first coordinate space range comprises: determining an upper bound of a coordinate space of one or more dimensions of the tensor data that the first operation is allowed to use; and/or
Determining the second coordinate space range comprises: determining a lower bound of a coordinate space for one or more dimensions of the tensor data expected to be used by the first operation.
9. The method of any of claims 1-8, wherein the size and/or number of fine-grained regions is determined based at least in part on at least one of:
the computing power of the hardware;
the bandwidth of the hardware; and
a size of a shape coordinate space of the tensor data.
10. The method of any of claims 1-9, wherein at least one of the first operation and the second operation is a write operation.
11. The method of any of claims 1-10, wherein:
the first operation and the second operation are respectively operations in different instructions executed in parallel; or
The first operation and the second operation are respectively different operations executed in parallel in the same instruction.
12. A processing apparatus, comprising:
an operation acquisition unit configured to acquire a first operation of an instruction, the first operation being an operation on tensor data whose shape coordinate space includes at least one fine-grained region including one or more adjacent coordinate points of the shape coordinate space;
a first determination unit configured to determine whether there is an ongoing second operation for the tensor data;
a second determining unit, configured to determine, when the second operation exists, whether there is an overlap between a first fine-grained region currently targeted by the first operation and a second fine-grained region currently targeted by the second operation; and
an execution unit configured to execute the first operation when the first fine-grained region and the second fine-grained region do not overlap.
13. The processing device of claim 12, wherein the processing device further comprises:
a blocking unit configured to block the first operation when there is an overlap of the first fine-grained region and the second fine-grained region.
14. The processing apparatus according to any one of claims 12 to 13, wherein the processing apparatus further comprises:
a third determination unit configured to determine whether a data operation range of the first operation overlaps with a data operation range of the second operation; and is
The second determination unit is configured to perform the determination whether there is an overlap between a first fine-grained region to which the first operation is currently directed and a second fine-grained region to which the second operation is currently directed, when the third determination unit determines that the data operation ranges overlap; and
the execution unit is configured to execute the first operation when the third determination unit determines that the data operation ranges do not overlap.
15. The processing apparatus according to claim 14, wherein the third determination unit determines whether the data operation range of the first operation overlaps with the data operation range of the second operation based on at least one of:
spatial information of tensor data to be operated; and/or
Shape information of tensor data to be operated on.
16. The processing apparatus according to claims 12-15, wherein the second determining unit further comprises:
a first determination subunit configured to determine a first coordinate space range of the tensor data that allows the first operation to use;
a second determination subunit configured to determine a second coordinate space range of the tensor data to be used when the first operation is performed; and is
The execution unit is further configured to perform the first operation within a third coordinate space range determined by an intersection of the first coordinate space range and the second coordinate space range,
wherein the first, second, and third coordinate space ranges are characterized using the fine-grained region.
17. The processing device of claim 16, wherein the first coordinate space range is determined based on at least one of:
the sequence of the operations;
operands involved in the operation;
a second coordinate space range of prior operation; and
a predetermined division of a shape coordinate space of the tensor data.
18. The processing apparatus according to any of claims 16-17, wherein the second coordinate space range is determined based on at least one of:
the execution range of the operation;
an access mode of operation;
a current execution state of the operation; and
a predetermined division of a shape coordinate space of the tensor data.
19. The processing apparatus according to any one of claims 16-18, wherein:
the first determining subunit is further configured to: determining an upper bound of a coordinate space of one or more dimensions of the tensor data that the first operation is allowed to use; and/or
The second determining subunit is further configured to: determining a lower bound of a coordinate space for one or more dimensions of the tensor data expected to be used by the first operation.
20. The processing apparatus according to any one of claims 12-19, wherein the size and/or number of fine-grained regions is determined based at least in part on at least one of:
the computing power of the hardware;
the bandwidth of the hardware; and
a size of a shape coordinate space of the tensor data.
21. A processing apparatus according to any of claims 12 to 20, wherein at least one of said first operation and said second operation is a write operation.
22. The processing apparatus according to any one of claims 12-21, wherein:
the first operation and the second operation are respectively operations in different instructions executed in parallel; or
The first operation and the second operation are respectively different operations executed in parallel in the same instruction.
23. A chip comprising a processing device according to any one of claims 12-22.
24. A card comprising the chip of claim 23.
CN202011270378.5A 2020-11-13 2020-11-13 Processing method, processing device and related product Pending CN114489799A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011270378.5A CN114489799A (en) 2020-11-13 2020-11-13 Processing method, processing device and related product
PCT/CN2021/123552 WO2022100345A1 (en) 2020-11-13 2021-10-13 Processing method, processing apparatus, and related product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011270378.5A CN114489799A (en) 2020-11-13 2020-11-13 Processing method, processing device and related product

Publications (1)

Publication Number Publication Date
CN114489799A true CN114489799A (en) 2022-05-13

Family

ID=81489888

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011270378.5A Pending CN114489799A (en) 2020-11-13 2020-11-13 Processing method, processing device and related product

Country Status (2)

Country Link
CN (1) CN114489799A (en)
WO (1) WO2022100345A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114816773A (en) * 2022-06-29 2022-07-29 浙江大华技术股份有限公司 Data processing method, system, electronic device and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7508397B1 (en) * 2004-11-10 2009-03-24 Nvidia Corporation Rendering of disjoint and overlapping blits
CN105260554A (en) * 2015-10-27 2016-01-20 武汉大学 GPU cluster-based multidimensional big data factorization method
CN111079917B (en) * 2018-10-22 2023-08-11 北京地平线机器人技术研发有限公司 Tensor data block access method and device
CN111857828B (en) * 2019-04-25 2023-03-14 安徽寒武纪信息科技有限公司 Processor operation method and device and related product
CN111401510A (en) * 2019-09-24 2020-07-10 上海寒武纪信息科技有限公司 Data processing method and device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114816773A (en) * 2022-06-29 2022-07-29 浙江大华技术股份有限公司 Data processing method, system, electronic device and storage medium
CN114816773B (en) * 2022-06-29 2022-09-23 浙江大华技术股份有限公司 Data processing method, system, electronic device and storage medium

Also Published As

Publication number Publication date
WO2022100345A1 (en) 2022-05-19

Similar Documents

Publication Publication Date Title
CN107533751B (en) Line buffer unit for image processor
US9934153B2 (en) Patch memory system
CN104036537A (en) Multiresolution Consistent Rasterization
US20210150325A1 (en) Data processing method and apparatus, and related product
CN113836049B (en) Memory access method and electronic device
CN114580606A (en) Data processing method, data processing device, computer equipment and storage medium
WO2020200244A1 (en) Data processing method and apparatus, and related product
CN111857828B (en) Processor operation method and device and related product
WO2022100345A1 (en) Processing method, processing apparatus, and related product
US11354130B1 (en) Efficient race-condition detection
CN111782274B (en) Data processing device and related product
CN114489805A (en) Processing method, processing device and related product
CN114489803A (en) Processing device, processing method and related product
CN114281561A (en) Processing unit, synchronization method for a processing unit and corresponding product
CN114489804A (en) Processing method, processing device and related product
CN113867800A (en) Computing device, integrated circuit chip, board card, electronic equipment and computing method
WO2022100286A1 (en) Data processing apparatus, data processing method, and related product
CN114692844A (en) Data processing device, data processing method and related product
CN114489802A (en) Data processing device, data processing method and related product
CN114489788A (en) Instruction processing device, instruction processing method and related product
CN114489789A (en) Processing device, processing method and related product
JP7266121B2 (en) Computing equipment, chips, board cards, electronic devices and computing methods
WO2022001499A1 (en) Computing apparatus, chip, board card, electronic device and computing method
US9430304B2 (en) Method and system for block scheduling control in a processor by remapping
CN111783992A (en) Data processing device and related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination