WO2022001500A1 - Computing apparatus, integrated circuit chip, board card, electronic device, and computing method - Google Patents

Computing apparatus, integrated circuit chip, board card, electronic device, and computing method Download PDF

Info

Publication number
WO2022001500A1
WO2022001500A1 PCT/CN2021/095705 CN2021095705W WO2022001500A1 WO 2022001500 A1 WO2022001500 A1 WO 2022001500A1 CN 2021095705 W CN2021095705 W CN 2021095705W WO 2022001500 A1 WO2022001500 A1 WO 2022001500A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
instruction
circuit
slave
processing circuit
Prior art date
Application number
PCT/CN2021/095705
Other languages
French (fr)
Chinese (zh)
Inventor
陶劲桦
刘少礼
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Publication of WO2022001500A1 publication Critical patent/WO2022001500A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/781On-chip cache; Off-chip memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines

Definitions

  • the present disclosure relates generally to the field of computing. More particularly, the present disclosure relates to a computing device, an integrated circuit chip, a board, an electronic device, and a computing method.
  • an instruction set is a set of instructions for performing computations and controlling the computing system, and plays a key role in improving the performance of computing chips (eg, processors) in the computing system.
  • computing chips eg, processors
  • Various current computing chips can complete various general or specific control operations and data processing operations by using associated instruction sets.
  • the current instruction set still has many defects.
  • the existing instruction set is limited by hardware architecture and is less flexible in terms of flexibility.
  • many instructions can only complete a single operation, and the execution of multiple operations usually requires multiple instructions, which potentially leads to an increase in on-chip I/O data throughput.
  • the current instructions have improvements in execution speed, execution efficiency, and power consumption on the chip.
  • the arithmetic instructions of conventional processor CPUs are designed to perform basic single-data scalar arithmetic operations.
  • a single data operation means that each operand of the instruction is a scalar data.
  • the oriented operands are often multi-dimensional vector (ie, tensor data) data types, and only using scalar operations cannot enable hardware to efficiently complete computing tasks. Therefore, how to efficiently perform multi-dimensional tensor operations is also an urgent problem to be solved in the current computing field.
  • the present disclosure provides a hardware architecture platform and a solution of related instructions.
  • the flexibility of the instruction can be increased, the efficiency of the instruction execution can be improved, and the calculation cost and overhead can be reduced.
  • the solution of the present disclosure supports efficient memory access and processing of tensor data on the basis of the aforementioned hardware architecture, thereby accelerating tensor operations and reducing the cost of tensor operations when multi-dimensional vector operands are included in the calculation instructions. computational overhead.
  • the present disclosure discloses a computing device comprising a master processing circuit and at least one slave processing circuit, wherein: the master processing circuit is configured to perform a master arithmetic operation in response to a master instruction, the slave processing circuit be configured to perform a slave operation operation in response to a slave instruction, wherein the master operation operation includes a pre-processing operation and/or a post-processing operation for the slave operation operation, the master instruction and the slave instruction according to the
  • the calculation instruction received by the computing device is obtained by parsing, wherein the operand of the calculation instruction includes a descriptor used to indicate the shape of the tensor, and the descriptor is used to determine the storage address of the data corresponding to the operand, wherein the main The processing circuit and/or the slave processing circuit are configured to perform respective corresponding master arithmetic operations and/or slave processing operations according to the storage address.
  • the present disclosure discloses an integrated circuit chip comprising the computing device mentioned in the previous aspect and described in the various embodiments that follow.
  • the present disclosure discloses a board including the integrated circuit chip mentioned in the previous aspect and described in various embodiments later.
  • the present disclosure discloses an electronic device comprising the integrated circuit chip mentioned in the previous aspect and described in the various embodiments that follow.
  • the present disclosure discloses a method of performing a computing operation using the aforementioned computing device, wherein the computing device includes a master processing circuit and at least one slave processing circuit, the method comprising: converting the master processing circuit The circuit is configured to perform a master arithmetic operation in response to a master instruction, and the slave processing circuit is configured to perform a slave arithmetic operation in response to the slave instruction, wherein the master arithmetic operation includes a preprocessing operation for the slave arithmetic operation and /or post-processing operation, the master instruction and the slave instruction are parsed according to the calculation instruction received by the computing device, wherein the operand of the calculation instruction includes a descriptor for indicating the shape of the tensor, the description The operator is used to determine the storage address of the data corresponding to the operand, wherein the method further includes configuring the master processing circuit and/or the slave processing circuit to perform the respective corresponding master operation and/or according to the storage address. from the processing operation.
  • the master and slave instructions related to the master operation and the slave operation can be efficiently executed, thereby speeding up the execution of the operation. Further, due to the combination of the master operation and the slave operation, the computing device of the present disclosure can support more types of operations and operations. In addition, based on the pipeline operation arrangement of the computing device of the present disclosure, the computing instructions can be flexibly configured to meet computing requirements.
  • 1a is a schematic diagram illustrating a computing device according to an embodiment of the present disclosure
  • Figure 1b is a schematic diagram illustrating a data storage space according to an embodiment of the present disclosure
  • FIG. 2 is a block diagram illustrating a computing device according to an embodiment of the present disclosure
  • FIG. 3 is a block diagram illustrating a main processing circuit of a computing device according to an embodiment of the present disclosure
  • 4a, 4b and 4c are schematic diagrams illustrating matrix conversion performed by a data conversion circuit according to an embodiment of the present disclosure
  • FIG. 5 is a block diagram illustrating a slave processing circuit of a computing device according to an embodiment of the present disclosure
  • FIG. 6 is a block diagram illustrating a combined processing apparatus according to an embodiment of the present disclosure.
  • FIG. 7 is a schematic structural diagram illustrating a board according to an embodiment of the present disclosure.
  • the solution of the present disclosure utilizes the hardware architecture of a master processing circuit and at least one slave processing circuit to perform associated data operations, so that relatively complex operations can be performed using relatively flexible and simplified computing instructions.
  • the solution of the present disclosure utilizes the master instruction and the slave instruction obtained by parsing the computing instruction, and causes the master processing circuit to execute the master instruction to realize the master operation, and the slave processing circuit to execute the slave instruction to realize the slave operation, so as to Implements various complex operations including, for example, vector operations.
  • the master arithmetic operation may include a pre-processing operation and/or a post-processing operation with respect to the slave arithmetic operation.
  • the preprocessing operation may be, for example, a data conversion operation and/or a data concatenation operation.
  • the post-processing operation may be, for example, an arithmetic operation on the result output from the processing circuit.
  • the disclosed scheme utilizes the descriptor to determine the storage address of the data corresponding to the operand.
  • the master processing circuit and/or the slave processing circuit may be configured to perform respective corresponding master operation and/or slave operation according to the storage address, wherein the master operation and/or the slave operation may involve a Various operations on data.
  • the computing instructions of the present disclosure support flexible and personalized configurations to meet different application scenarios.
  • FIG. 1a is a schematic diagram illustrating a computing device 100 according to an embodiment of the present disclosure.
  • computing device 100 may include a master processing circuit 102 and slave processing circuits, such as slave processing circuits 104, 106 and 108 shown in the figure.
  • slave processing circuits such as slave processing circuits 104, 106 and 108 shown in the figure.
  • three slave processing circuits are shown here, those skilled in the art will understand that the computing device 100 of the present disclosure may include any suitable number of slave processing circuits, and between multiple slave processing circuits, multiple slave processing circuits and a master processing circuit
  • the circuits may be connected in different ways, and this disclosure does not make any limitation.
  • the multiple slave processing circuits of the present disclosure can execute various slave instructions (eg, obtained by parsing the computing instructions) in parallel, so as to improve the processing efficiency of the computing device.
  • a computing instruction may be an instruction in an instruction system of an interactive interface between software and hardware, which may be a binary or other form of machine language that is received and processed by hardware such as a processor (or processing circuit) .
  • Compute instructions may include opcodes and operands for instructing the processor to operate.
  • the calculation instruction may include one or more operation codes, and when the foregoing calculation instruction includes one operation code, the operation code may be used to instruct multiple operations of the processor.
  • a compute instruction may also include one or more operands.
  • the operand may include a descriptor for indicating the shape of the tensor, and the descriptor may be used to determine the storage address of the data corresponding to the operand.
  • the master instruction and the slave instruction may be obtained by parsing the computing instruction received by the computing device.
  • the master processing circuit may be configured to perform a master arithmetic operation in response to a master instruction
  • the slave processing circuit may be configured to perform a slave arithmetic operation in response to a slave instruction.
  • the aforementioned master instruction or slave instruction may be a microinstruction or a control signal running inside the processor, and may include (or instruct) one or more operations.
  • the master processing circuit and/or the slave processing circuit may be configured to fetch the tensors according to the memory addresses obtained based on the descriptors. Through the descriptor-based memory access mechanism, the solution of the present disclosure can significantly improve the reading and storage speed of tensor data in performing tensor operations, thereby speeding up computation and reducing computation overhead.
  • the aforementioned master arithmetic operations may include pre-processing operations and/or post-processing operations for the slave arithmetic operations.
  • the main instruction executed by the main processing circuit it may include, for example, a preprocessing operation of performing data conversion and/or data splicing on the data to be involved in the operation.
  • the master instruction may also include preprocessing operations that only selectively read data, such as reading and sending data stored in a dedicated or private buffer to the slave processing circuit, or for the slave processing circuit The operation generates the corresponding random number.
  • the main instruction may include one or more post-processing operations associated with the functions of the operators.
  • the main instruction may include various types of operations such as addition, multiplication, table lookup, comparison, averaging, filtering, etc., on the intermediate or final operation result obtained after the slave processing circuit executes the slave instruction.
  • the aforementioned intermediate operation result or final operation result may be the aforementioned tensor, and its storage address may be obtained according to the descriptor of the present disclosure.
  • the main instruction may include an identification bit for identifying the pre-processing operation and/or the post-processing operation. Therefore, when acquiring the main instruction, the main processing circuit can determine whether to perform a pre-processing operation or a post-processing operation on the operation data according to the identification bit. Additionally or alternatively, the pre-processing operation and the post-processing operation in the main instruction may be matched by a preset position (or called an instruction field segment) of the calculation instruction. For example, when a preset position including (master instruction+slave instruction) is set in the calculation instruction, it can be determined that the master instruction in the calculation instruction involves the preprocessing operation for the slave operation.
  • the master instruction in the calculation instruction involves the post-processing operation on the slave operation.
  • the calculation instruction has a length of three predetermined bit widths (that is, the aforementioned preset positions)
  • the instruction located in the first predetermined bit width can be designated as the main instruction for the preprocessing operation, located in the middle position
  • the second segment of pre-width instructions is designated as a slave instruction for slave operations
  • the third segment of pre-width instructions in the last position is designated as a master instruction for post-processing operations.
  • a slave instruction to be executed by the slave processing circuit may include one or more operations associated with the functionality of one or more arithmetic circuits in the slave processing circuit.
  • the slave instruction may include an operation for performing an operation on the data after the preprocessing operation is performed by the master processing circuit.
  • the slave instruction may include various operations such as arithmetic operation, logical operation, and data type conversion.
  • the slave instruction may include performing various vector-related multiply-add operations on the data subjected to the preprocessing operation, including, for example, convolution operations.
  • the slave processing circuit may also directly perform a slave operation on the input data according to the slave instruction.
  • the master processing circuit 102 may be configured to obtain and parse the computational instructions to obtain the aforementioned master and slave instructions, and to send the slave instructions to the slave processing circuits.
  • the main processing circuit may include one or more decoding circuits (or decoders) for parsing computational instructions.
  • the master processing circuit can parse the received calculation instruction into one or more master instructions and/or slave instructions, and send the corresponding slave instructions to the slave processing circuit, so that the slave processing circuit can execute the slave operation. operate.
  • the slave instruction can be sent to the slave processing circuit in various ways.
  • the master processing circuitry may send slave instructions to the storage circuitry, and via the storage circuitry to slave processing circuitry.
  • the master processing circuit may broadcast the same slave instruction to the multiple slave processing circuits.
  • the computing device may further include a separate circuit, unit or module dedicated to parsing the computing instructions received by the computing device, such as the architecture described later in conjunction with FIG. 2 . .
  • the slave processing circuits of the present disclosure may include a plurality of arithmetic circuits for performing slave arithmetic operations, wherein the plurality of arithmetic circuits may be connected and configured to perform multi-stage pipelined arithmetic operations.
  • the operation circuit may include one or more of a multiplication circuit, a comparison circuit, an accumulation circuit, and a rotation circuit for performing at least vector operations.
  • the slave processing circuit can perform the multi-dimensional convolution operation in the neural network according to the slave instruction.
  • the master operation and/or the slave operation of the present disclosure may also include various types of operations on tensor data, for which the solution of the present disclosure proposes to use descriptors to obtain information about the shape of the tensor , so as to determine the storage address of the tensor data, so as to obtain and save the tensor data through the aforementioned storage address.
  • tensors can contain various forms of data composition.
  • Tensors can be of different dimensions. For example, scalars can be regarded as 0-dimensional tensors, vectors can be regarded as 1-dimensional tensors, and matrices can be 2-dimensional or 2-dimensional. A tensor of more than dimensionality.
  • the shape of a tensor includes information such as the dimension of the tensor and the dimensions of each dimension of the tensor. For example, for tensors:
  • the shape of the tensor can be described by the descriptor as (2, 4), that is, two parameters indicate that the tensor is a two-dimensional tensor, and the size of the first dimension (column) of the tensor is 2, the first dimension The dimension of the two dimensions (rows) is 4. It should be noted that the present application does not limit the manner in which the descriptor indicates the shape of the tensor.
  • the value of N may be determined according to the dimension (order) of the tensor data, and may also be set according to the usage requirements of the tensor data. For example, when the value of N is 3, the tensor data is three-dimensional tensor data, and the descriptor can be used to indicate the shape (eg offset, size, etc.) of the three-dimensional tensor data in the three-dimensional direction. It should be understood that those skilled in the art can set the value of N according to actual needs, which is not limited in the present disclosure.
  • the descriptor may include the identifier of the descriptor and/or the content of the descriptor.
  • the identifier of the descriptor is used to distinguish the descriptor, for example, the identifier of the descriptor may be numbered; the content of the descriptor may include at least one shape parameter representing the shape of the tensor data.
  • the tensor data is 3-dimensional data. Among the three dimensions of the tensor data, the shape parameters of two dimensions are fixed, and the content of the descriptor may include the shape representing the other dimension of the tensor data. parameter.
  • the identifier and/or content of the descriptor may be stored in a descriptor storage space (internal memory), such as a register, on-chip SRAM or other medium caches, and the like.
  • the tensor data indicated by the descriptor can be stored in data storage space (internal memory or external memory), such as on-chip cache or off-chip memory, etc.
  • data storage space internal memory or external memory
  • present disclosure does not limit the specific locations of the descriptor storage space and the data storage space.
  • the identifier, content of the descriptor, and tensor data indicated by the descriptor can be stored in the same area of the internal memory, for example, a continuous area of the on-chip cache can be used to store the related information of the descriptor content, its address is ADDR0-ADDR1023.
  • the addresses ADDR0-ADDR63 can be used as the descriptor storage space to store the identifier and content of the descriptor
  • the addresses ADDR64-ADDR1023 can be used as the data storage space to store the tensor data indicated by the descriptor.
  • addresses ADDR0-ADDR31 can be used to store the identifier of the descriptor
  • addresses ADDR32-ADDR63 can be used to store the content of the descriptor.
  • the address ADDR is not limited to 1 bit or one byte, and is used here to represent an address, which is an address unit.
  • Those skilled in the art can determine the descriptor storage space, the data storage space and their specific addresses according to actual conditions, which are not limited in this disclosure.
  • the identifier, content of the descriptor, and tensor data indicated by the descriptor may be stored in different areas of the internal memory.
  • a register can be used as a descriptor storage space to store the identifier and content of the descriptor in the register
  • an on-chip cache can be used as a data storage space to store the tensor data indicated by the descriptor.
  • the number of the register may be used to represent the identifier of the descriptor. For example, when the number of the register is 0, the identifier of the descriptor it stores is set to 0. When the descriptor in the register is valid, an area can be allocated in the cache space for storing the tensor data according to the size of the tensor data indicated by the descriptor.
  • the identifier and content of the descriptor may be stored in an internal memory, and the tensor data indicated by the descriptor may be stored in an external memory.
  • the identifier and content of the descriptor can be stored on-chip, and the tensor data indicated by the descriptor can be stored off-chip.
  • the data address of the data storage space corresponding to each descriptor may be a fixed address.
  • a separate data storage space can be divided for tensor data, and the starting address of each tensor data in the data storage space corresponds to a descriptor one-to-one.
  • the control circuit can determine the data address in the data storage space of the data corresponding to the operand according to the descriptor.
  • the descriptor when the data address of the data storage space corresponding to the descriptor is a variable address, the descriptor may also be used to indicate the address of N-dimensional tensor data, wherein the descriptor The content of can also include at least one address parameter representing the address of the tensor data.
  • the content of the descriptor may include an address parameter indicating the address of the tensor data, such as the starting physical address of the tensor data, It may also include multiple address parameters of the address of the tensor data, such as the start address + address offset of the tensor data, or address parameters of the tensor data based on each dimension.
  • address parameters such as the start address + address offset of the tensor data, or address parameters of the tensor data based on each dimension.
  • the address parameter of the tensor data may include a reference address of the data reference point of the descriptor in the data storage space of the tensor data.
  • the reference address can be different according to the change of the data reference point. This disclosure does not limit the selection of data benchmarks.
  • the reference address may include a start address of the data storage space.
  • the reference address of the descriptor is the starting address of the data storage space.
  • the reference address of the descriptor is the address of the data block in the data storage space.
  • the shape parameter of the tensor data includes at least one of the following: the size of the data storage space in at least one direction of N dimensions, the size of the storage area in N dimensions The size in at least one direction of , the offset of the storage area in at least one direction of N dimensions, the position of at least two vertices at the diagonal positions of N dimensions relative to the data reference point , the mapping relationship between the data description location of the tensor data indicated by the descriptor and the data address.
  • the data description position is the mapping position of the point or area in the tensor data indicated by the descriptor.
  • the descriptor can be represented by three-dimensional space coordinates (x, y, z).
  • the shape of the tensor data, and the data description position of the tensor data may be the position of a point or area in the three-dimensional space that the tensor data is mapped to, represented by three-dimensional space coordinates (x, y, z).
  • the reference address of the data reference point of the descriptor in the data storage space of the tensor data, the reference address of the data storage space in at least one of the N dimension directions of the data storage space may be used in a possible implementation manner.
  • the size, the size of the storage area in at least one of the N-dimensional directions, and/or the offset of the storage area in at least one of the N-dimensional directions determine the size of the descriptor of the tensor data. content.
  • Figure 1b shows a schematic diagram of a data storage space according to an embodiment of the present disclosure.
  • the data storage space 21 stores a two-dimensional data in a row-first manner, which can be represented by (x, y) (where the X axis is horizontally to the right, and the Y axis is vertically downward), and the X axis direction
  • the size (the size of each line) on the ori_x (not shown in the figure), the size in the Y-axis direction (the total number of lines) is ori_y (not shown in the figure)
  • the starting address of the data storage space 21 PA_start (reference address) is the physical address of the first data block 22.
  • the data block 23 is part of the data in the data storage space 21, the offset 25 in the X-axis direction is represented as offset_x, the offset 24 in the Y-axis direction is represented as offset_y, and the size in the X-axis direction is represented by is size_x, and the size in the Y-axis direction is represented by size_y.
  • the data reference point of the descriptor may use the first data block of the data storage space 21, and the reference address of the descriptor may be agreed as the data storage space 21 The starting address of PA_start. Then the size ori_x on the X axis, the size ori_y on the Y axis of the data storage space 21, and the offset amount offset_y in the Y axis direction, the offset amount offset_x in the X axis direction, and the offset amount in the X axis direction of the data block 23 can be combined.
  • the content of the descriptor of the data block 23 is determined by the size size_x and the size size_y in the Y-axis direction.
  • the content of the descriptor represents a two-dimensional space
  • those skilled in the art can set the specific dimension of the content of the descriptor according to the actual situation, which is not limited in the present disclosure.
  • a reference address of the data reference point of the descriptor in the data storage space may be agreed, and on the basis of the reference address, according to at least two diagonal positions in the N dimension directions The position of each vertex relative to the data reference point determines the content of the descriptor of the tensor data.
  • the base address PA_base of the data base point of the descriptor in the data storage space may be agreed.
  • one piece of data for example, the data at the position (2, 2)
  • the physical address of the data in the data storage space may be used as the reference address PA_base.
  • the content of the descriptor of the data block 23 in FIG. 1b can be determined according to the positions of the two diagonal vertices relative to the data reference point.
  • the positions of at least two vertices of the diagonal positions of the data block 23 relative to the data reference point are determined, for example, the positions of the diagonal position vertices relative to the data reference point in the upper left to lower right direction are used, wherein the upper left corner vertex is The relative position is (x_min, y_min), the relative position of the lower right vertex is (x_max, y_max), and then the relative position of the upper left vertex (x_min, y_min) and the relative position of the lower right vertex (x_max, y_max) determines the content of the descriptor of the data block 23.
  • the following formula (2) can be used to represent the content of the descriptor (the base address is PA_base):
  • the reference address of the data reference point of the descriptor in the data storage space and the distance between the data description position and the data address of the tensor data indicated by the descriptor may be used.
  • the mapping relationship determines the content of the descriptor of the tensor data.
  • the mapping relationship between the data description position and the data address can be set according to actual needs. For example, when the tensor data indicated by the descriptor is three-dimensional space data, the function f(x, y, z) can be used to define The data describes the mapping relationship between the location and the data address.
  • the descriptor is further used to indicate the address of N-dimensional tensor data, wherein the content of the descriptor further includes at least one address parameter indicating the address of the tensor data, such as description
  • the content of the character can be:
  • PA is the address parameter.
  • the address parameter can be a logical address or a physical address.
  • the descriptor parsing circuit can take PA as any one of the vertex, middle point or preset point of the vector shape, and obtain the corresponding data address in combination with the shape parameters in the X and Y directions.
  • the address parameter of the tensor data includes a reference address of the data reference point of the descriptor in the data storage space of the tensor data, and the reference address includes the data storage The starting address of the space.
  • the descriptor may further include at least one address parameter representing the address of the tensor data, for example, the content of the descriptor may be:
  • PA_start is a reference address parameter, which is not repeated here.
  • mapping relationship between the data description location and the data address can be set according to the actual situation, which is not limited in the present disclosure.
  • a predetermined reference address may be set in a task, the descriptors in the instructions under this task all use the reference address, and the content of the descriptor may include shape parameters based on the reference address.
  • the base address can be determined by setting the environment parameters for this task. For the relevant description and usage of the reference address, reference may be made to the foregoing embodiments.
  • the content of the descriptor can be mapped to the data address more quickly.
  • a reference address may be included in the content of each descriptor, and the reference address of each descriptor may be different. Compared with the method of setting a common reference address by using environmental parameters, each descriptor in this method can describe data more flexibly and use a larger data address space.
  • the data address in the data storage space of the data corresponding to the operand of the processing instruction may be determined according to the content of the descriptor.
  • the calculation of the data address is automatically completed by the hardware, and when the representation of the content of the descriptor is different, the calculation method of the data address is also different. This disclosure does not limit the specific calculation method of the data address.
  • the content of the descriptor in the operand is represented by formula (1)
  • the offsets of the tensor data indicated by the descriptor in the data storage space are offset_x and offset_y respectively
  • the size is size_x*size_y
  • the The starting data address PA1 (x, y) of the tensor data indicated by the descriptor in the data storage space can be determined using the following formula (4):
  • PA1 (x,y) PA_start+(offset_y-1)*ori_x+offset_x(4)
  • the data address in the data storage space of the data corresponding to the operand can be determined according to the content of the descriptor and the data description location. In this way, part of the data (eg, one or more data) in the tensor data indicated by the descriptor can be processed.
  • the content of the descriptor in the operand is represented by formula (1).
  • the offsets of the tensor data indicated by the descriptor in the data storage space are offset_x and offset_y respectively, and the size is size_x*size_y.
  • the operand includes The data description position for the descriptor is (x q , y q ), then, the data address PA2 (x, y) of the tensor data indicated by the descriptor in the data storage space can use the following formula (5) to make sure:
  • PA2 (x,y) PA_start+(offset_y+y q -1)*ori_x+(offset_x+x q ) (5)
  • the computing device of the present disclosure has been described above in conjunction with FIG. 1a and FIG. 1b.
  • one computing instruction can be used to complete multiple operations, reducing the need for multiple instructions for multiple operations. Completing the data transfer required for each instruction caused by it solves the IO bottleneck problem of the computing device, effectively improving the computing efficiency and reducing the computing overhead.
  • the solution of the present disclosure can also flexibly set the parameters included in the calculation instruction according to the type of the arithmetic unit configured in the master processing circuit, the function of the arithmetic circuit configured in the slave processing circuit, and through the cooperation of the master processing circuit and the slave processing circuit.
  • the type and number of operations can be used so that the computing device can perform various types of computing operations, thereby expanding and enriching the application scenarios of the computing device and meeting different computing requirements.
  • the master processing circuit and the slave processing circuit can be configured to support multi-stage pipeline operations, the execution efficiency of the operators in the master processing circuit and the slave processing circuit is improved, and the calculation time is further shortened.
  • the hardware architecture shown in FIG. 1a is only exemplary and not limiting. Under the disclosure and teaching of the present disclosure, those skilled in the art can also add new circuits or devices based on the architecture to realize more functions or operations. For example, memory circuits can be added to the architecture shown in FIG. 1a to store various types of instructions and data (eg, tensor data).
  • the master processing circuit and the slave processing circuit can also be arranged in different physical or logical positions, and the two can be connected through various data interfaces or interconnection units, so that the above master processing circuit can be completed through the interaction of the two.
  • Arithmetic operations and slave arithmetic operations include various operations on the tensors described in connection with Figure 1b.
  • FIG. 2 is a block diagram illustrating a computing device 200 according to an embodiment of the present disclosure. It can be understood that the computing device 200 shown in FIG. 2 is a specific implementation of the computing device 100 shown in FIG. 1a, so the details of the master processing circuit and the slave processing circuit of the computing device 100 described in conjunction with FIG. 1a The same applies to the computing device 200 shown in FIG. 2 .
  • a computing device 200 includes a master processing circuit 202 and a plurality of slave processing circuits 204 , 206 and 208 . Since the operations of the master processing circuit and the slave processing circuit have been described in detail above with reference to FIG. 1a, they will not be repeated here.
  • computing device 200 of FIG. 2 also includes control circuitry 210 and storage circuitry 212 .
  • control circuit may be configured to obtain and parse the calculation instruction to obtain the master and slave instructions, and to send the master instruction to the master processing circuit 202 and to The slave instructions are sent to one or more of the plurality of slave processing circuits 204 , 206 and 208 .
  • the control circuit may send the parsed slave instruction to the slave processing circuit through the master processing circuit, as shown in FIG. 2 .
  • the control circuit may also directly send the parsed slave instruction to the slave processing circuit.
  • the control circuit may also send slave instructions to the slave processing circuit via the storage circuit.
  • the control circuit can use the previously discussed descriptor to determine the storage address of the data corresponding to the operand, such as the starting address of the tensor, and can indicate The master processing circuit or the slave processing circuit acquires the tensor data participating in the tensor operation from the corresponding storage address in the storage circuit 212 in order to perform the tensor operation.
  • storage circuitry 212 may store various types of computation-related data or instructions, including, for example, the tensors described above.
  • the storage circuit may store neuron or weight data related to the operation of the neural network, or store the final operation result obtained after the main processing circuit performs post-processing operations. Additionally, the storage circuit may store an intermediate result obtained after the preprocessing operation is performed by the main processing circuit, or an intermediate result obtained after the operation operation is performed by the slave processing circuit.
  • the aforementioned intermediate results can also be tensor-type data, and are read and stored through the storage address determined by the descriptor.
  • storage circuitry may be used as on-chip memory for computing device 200 to perform data read and write operations with off-chip memory, such as through a direct memory access ("DMA") interface.
  • DMA direct memory access
  • the storage circuit may store the operation instruction obtained after parsing by the control circuit, such as a master instruction and/or a slave instruction.
  • the storage circuit is shown in a block diagram in FIG.
  • the storage circuit can be implemented as a memory including a main memory and a main buffer, wherein the main memory can be used to store related operation data such as neural elements, weights and various constant items, and the main cache module can be used to temporarily store intermediate data, such as the data after the pre-processing operation and the data before the post-processing operation, and these intermediate data can be used for the operator according to the settings. Invisible.
  • the pipeline operation circuit in the main processing circuit can also perform corresponding operations by means of the mask stored in the main storage circuit.
  • the operation circuit may read a mask from the main storage circuit, and the mask may be used to indicate whether the data in the operation circuit to perform the operation operation is valid.
  • the main storage circuit can not only perform internal storage applications, but also have the function of data interaction with storage devices outside the computing device of the present disclosure, such as data exchange with external storage devices through direct memory access ("DMA").
  • DMA direct memory access
  • FIG. 3 is a block diagram illustrating a main processing circuit 300 of a computing device according to an embodiment of the present disclosure. It can be understood that the main processing circuit 300 shown in FIG. 3 is also the main processing circuit shown and described in conjunction with FIG. 1 a and FIG. 2 , so the description of the main processing circuit in FIG. 1 a and FIG. In conjunction with the description of FIG. 3 .
  • the main processing circuit 300 may include a data processing unit 302, a first group of pipeline operation circuits 304, a last group of pipeline operation circuits 306, and one or more groups of pipeline operation circuits located between the two groups (replace with black circles).
  • the data processing unit 302 includes a data conversion circuit 3021 and a data splicing circuit 3022 .
  • the master operation includes a preprocessing operation for the slave operation, such as a data conversion operation or a data splicing operation
  • the data conversion circuit 3021 or the data splicing circuit 3022 will perform the corresponding conversion according to the corresponding master instruction operation or stitching operation.
  • the conversion operation and the splicing operation will be described below with an example.
  • the data conversion circuit can convert the input data to a lower bit width according to the operation requirements.
  • data for example, the bit width of the output data is 512 bits wide.
  • the data conversion circuit can support conversion between multiple data types, such as FP16 (floating point 16-bit), FP32 (floating point 32-bit), FIX8 (fixed-point 8-bit), FIX4 (fixed-point 8-bit) Conversion between data types with different bit widths, such as 4-bit point), FIX16 (16-bit fixed point), etc.
  • the data conversion operation may be a conversion performed on the arrangement positions of the matrix elements.
  • the transformation may include, for example, matrix transposition and mirroring (described later in conjunction with Figures 4a-4c), matrix rotation according to a predetermined angle (eg, 90 degrees, 180 degrees or 270 degrees), and transformation of matrix dimensions.
  • the data splicing circuit can perform parity splicing and other operations on the data blocks extracted from the data according to, for example, the bit length set in the instruction. For example, when the data bit length is 32 bits wide, the data splicing circuit can divide the data into 8 data blocks from 1 to 8 according to the bit width length of 4 bits, and then divide the data blocks 1, 3, 5 and 7 into four data blocks. The data blocks are spliced together, and four data blocks of data 2, 4, 6, and 8 are spliced together for operation.
  • the above-mentioned data splicing operation may also be performed on the data M (for example, a vector) obtained after performing the operation. It is assumed that the data splicing circuit can split the lower 256 bits of the even-numbered lines of the data M first with an 8-bit bit width as 1 unit data to obtain 32 even-numbered line unit data (represented as M_2i 0 to M_2i 31 respectively ). Similarly, the lower 256 bits of the odd-numbered rows of the data M can also be split with an 8-bit bit width as 1 unit data to obtain 32 odd-numbered row unit data (represented as M_(2i+1) 0 to M_ (2i+1) 31 ).
  • the even-numbered rows first and then the odd-numbered rows, the split 32 odd-numbered row unit data and 32 even-numbered row unit data are alternately arranged.
  • the even-numbered line unit data 0 (M_2i 0 ) is arranged in the lower order, and then the odd-numbered line unit data 0 (M_(2i+1) 0 ) is sequentially arranged.
  • even line unit data 1 (M_2i 1 ) . . . are arranged.
  • 64 unit data are spliced together to form a new data with a bit width of 512 bits.
  • the data conversion circuit and the data splicing circuit in the data processing unit can be used together to perform data preprocessing more flexibly.
  • the data processing unit may perform only data conversion without data splicing, only data splicing without data conversion, or both data conversion and data splicing.
  • the data processing unit may be configured to disable the data conversion circuit and the data splicing circuit.
  • the main processing circuit may include one or more groups of multi-stage pipeline operation circuits, such as two groups of multi-stage pipeline operation circuits 304 and 306 as shown in FIG. 3 , wherein each group of multi-stage pipeline operation circuits The circuit performs a multi-stage pipeline operation including the first stage to the Nth stage, wherein each stage may include one or more operators to perform the multi-stage pipeline operation according to the main instruction.
  • the main processing circuit of the present disclosure may be implemented as a single instruction multiple data (Single Instruction Multiple Data, SIMD) module, and each group of multi-stage pipeline operation circuits may form an operation pipeline.
  • SIMD Single Instruction Multiple Data
  • the operation pipeline can be provided with different, different or the same functional modules (that is, the operators of the present disclosure), such as an addition module (or an adder), a multiplication module (or a multiplier), a checker Various types of functional modules such as table modules (or table look-ups).
  • the SIMD of the present disclosure can support different levels of pipelining. That is, the SIMD of the present disclosure can flexibly support combinations of different numbers of ops based on the arrangement of the operators on the operation pipeline.
  • stage1 there is a pipeline (named “stage1") similar to the first group of multi-stage pipeline operation circuits 304 and the second group of multi-stage pipeline operation circuits 306, and six pipelines are arranged in order from top to bottom Function modules to form a six-stage pipeline, specifically: stage1-1-adder 1 (first-stage adder), stage1-2-adder 2 (second-stage adder), stage1-3-multiplier 1 ( First stage multiplier), stage1-4-multiplier 2 (second stage multiplier), stage1-5-adder 1 (first stage adder), stage1-6-adder 2 (second stage adder) ).
  • stage1-1-adder 1 first-stage adder
  • stage1-2-adder 2 second-stage adder
  • stage1-3-multiplier 1 First stage multiplier
  • stage1-4-multiplier 2 second stage multiplier
  • stage1-5-adder 1 first stage adder
  • stage1-6-adder 2 second stage adder
  • first stage adder which acts as the first stage of the pipeline
  • second stage adder which acts as the second stage of the pipeline
  • first-stage multiplier and the second-stage multiplier also perform similar two-stage operations.
  • the two-stage adder or multiplier here is merely an example and not a limitation, and in some application scenarios, only one-stage adder or multiplier may be provided in a multi-stage pipeline.
  • each pipeline may include several same or different operators to achieve the same or different functions.
  • different pipelines may include different arithmetic units, so that each pipeline implements arithmetic operations of different functions.
  • the operators or circuits that realize the aforementioned different functions may include, but are not limited to, random number processing circuits, addition and subtraction circuits, subtraction circuits, table look-up circuits, parameter configuration circuits, multipliers, dividers, poolers, comparators, and absolute values. circuit, logic operator, position index circuit or filter.
  • a pooler is taken as an example, which can be exemplarily constituted by operators such as an adder, a divider, a comparator, etc., so as to perform the pooling operation in the neural network.
  • multi-stage pipeline operations in the main processing circuit can support unary operations (ie, a situation where there is only one item of input data).
  • unary operations ie, a situation where there is only one item of input data.
  • a set of three-stage pipeline operation circuits including a multiplier, an adder, and a nonlinear operator of the present disclosure can be applied to perform the operation.
  • the multiplier of the first-stage pipeline can be used to calculate the product of the input data ina and a to obtain the first-stage pipeline operation result.
  • the adder of the second-stage pipeline can be used to perform an addition operation on the first-stage pipeline operation result (a*ina) and b to obtain the second-stage pipeline operation result.
  • the relu activation function of the third-stage pipeline can be used to activate the second-stage pipeline operation result (a*ina+b) to obtain the final operation result result.
  • the two input data ina and inb can be, for example, neuron data.
  • an addition operation may be performed on the first-stage pipeline operation result "product” by using the addition tree in the second-stage pipeline operation circuit to obtain the second-stage pipeline operation result sum.
  • use the nonlinear operator of the third-stage pipeline operation circuit to perform an activation operation on "sum” to obtain the final convolution operation result.
  • a bypass operation can be performed on a one-stage or multi-stage pipeline operation circuit that will not be used in the operation operation, that is, one or more stages of the multi-stage pipeline operation circuit can be selectively used according to the needs of the operation operation. , without having to make the operation go through all the multi-stage pipeline operations.
  • the circuit is used to perform operations to obtain the final operation result, and the pipeline operation circuit that is not used can be bypassed before or during the operation of the pipeline operation.
  • each group of pipeline operation circuits can independently perform the pipeline operations.
  • each of the plurality of groups of pipelined circuits may also perform the pipelining operations cooperatively.
  • the output of the first stage and the second stage in the first group of pipeline operation circuits after performing serial pipeline operation can be used as the input of the third stage of pipeline operation of another group of pipeline operation circuits.
  • the first stage and the second stage in the first group of pipeline operation circuits perform parallel pipeline operations, and respectively output the results of their pipeline operations as the first and/or second stage pipelines of another group of pipeline operation circuits. Operation input.
  • 4a, 4b and 4c are schematic diagrams illustrating matrix conversion performed by a data conversion circuit according to an embodiment of the present disclosure.
  • the conversion operation performed by the data conversion circuit 3021 in the main processing circuit the following will take the transpose operation and the horizontal mirror operation performed by the original matrix (which can be regarded as a 2-dimensional tensor in the present disclosure) as an example for further details. describe.
  • the original matrix is a matrix of (M+1) rows by (N+1) columns.
  • the data conversion circuit can perform a transposition operation on the original matrix shown in FIG. 4a to obtain the matrix shown in FIG. 4b.
  • the data conversion circuit can exchange the row numbers and column numbers of elements in the original matrix to form a transposed matrix.
  • the coordinate in the original matrix shown in Fig. 4a is the element "10" in the 1st row and 0th column
  • its coordinate in the transposed matrix shown in Fig. 4b is the 0th row and the 1st column.
  • the coordinate is the element "M0" in the M+1th row and the 0th column
  • its coordinate in the transposed matrix shown in Figure 4b is the 0th row M+ 1 column.
  • the data conversion circuit may perform a horizontal mirror operation on the original matrix shown in FIG. 4a to form a horizontal mirror matrix.
  • the data conversion circuit can convert the arrangement order from the first row element to the last row element in the original matrix into the arrangement order from the last row element to the first row element through the horizontal mirror operation, and the elements in the original matrix
  • the column numbers remain the same.
  • the coordinates in the original matrix shown in Figure 4a are the element "00" in the 0th row and the 0th column and the element "10" in the 1st row and the 0th column, respectively.
  • the coordinates are the M+1 row, column 0, and the M row, column 0, respectively.
  • the coordinate is the element "M0" in the M+1th row and 0th column
  • the coordinate is the 0th row and 0th column.
  • FIG. 5 is a block diagram illustrating a slave processing circuit 500 of a computing device according to an embodiment of the present disclosure. It can be understood that the structures shown in the figures are only exemplary and not limiting, and those skilled in the art can also think of adding more arithmetic units to form more stages of pipeline arithmetic circuits based on the teachings of the present disclosure.
  • the slave processing circuit 500 includes a four-stage pipeline operation circuit composed of a multiplier 502 , a comparator 504 , a selector 506 , an accumulator 508 and a converter 510 .
  • the slave processing circuit as a whole may perform vector (including eg matrix) operations.
  • the slave processing circuit 500 controls the vector data including weight data and neuron data according to the received microinstructions (control signals as shown in the figure) (which can be regarded as 1-dimensional under the present disclosure). tensor) into the multiplier.
  • the multiplier inputs the result to the selector 506 .
  • selector 506 chooses to pass the result of the multiplier, rather than the result from the comparator, to accumulator 508, performing the accumulation operation in the vector operation.
  • the accumulator transmits the accumulated result to the converter 510 to perform the data conversion operation described above.
  • the accumulated sum ie "ACC_SUM" as shown in the figure
  • the four-stage pipeline operation circuit shown in FIG. 5 can also be used to perform histogram operations, depthwise operations in neural network operations Operations such as layer multiplication and addition, integration and winograd multiplication and addition.
  • histogram operation is performed, in the first stage of operation, input data is input to the comparator from the processing circuit according to the microinstruction. Accordingly, here selector 506 chooses to pass the result of the comparator, rather than the result of the multiplier, to the accumulator for subsequent operations.
  • the slave processing circuit of the present disclosure may include a plurality of operation circuits for performing slave operation operations, and the plurality of operation circuits are connected and configured to perform Multi-stage pipeline operation.
  • the aforementioned arithmetic circuits may include, but are not limited to, one or more of multiplying circuits, comparing circuits, accumulating circuits, and turning circuits to perform at least vector operations, such as multi-dimensional volumes in neural networks Product operation.
  • the slave processing circuit of the present disclosure may perform operations on the data pre-processing operations performed by the master processing circuit according to slave instructions (implemented as, for example, one or more microinstructions or control signals) to obtain expected operation results .
  • the slave processing circuit may send the intermediate result obtained after its operation (eg, via an interconnect interface) to the data processing unit in the master processing circuit, so that the data conversion circuit in the data processing unit can convert the intermediate result to the data processing unit.
  • the result performs data type conversion or data splitting and splicing operations on the intermediate result by the data splicing circuit in the data processing unit, thereby obtaining the final operation result.
  • COSHLC FPTOFIX+SHUFFLE+LT3DCONV
  • FPTOFIX represents the data type conversion operation performed by the data conversion circuit in the main processing circuit, that is, the input data is converted from a floating-point number to a fixed-point number
  • SHUFFLE represents the data splicing operation performed by the data splicing circuit
  • LT3DCONV represents the slave processing circuit ( 3DCONV operation performed by "LT" designation), that is, a convolution operation of 3-dimensional data. It can be understood that when only the convolution operation of 3-dimensional data is performed, both FPTOFIX and SHUFFLE, which are part of the main operation, can be set as optional operations.
  • the operations performed by it can be expressed as:
  • the subtractor in the main processing circuit can perform the subtraction operation SUB on the 3D convolution result.
  • one binary operand that is, the convolution result and the subtrahend
  • one unary operand that is, the final result obtained after executing the LCSU instruction
  • the data splicing circuit performs the data splicing operation represented by SHUFFLE.
  • the LT3DCONV operation is performed on the concatenated data by the slave processing circuit to obtain a convolution result in 3D.
  • the addition operation ADD is performed on the 3D convolution result by the adder in the main processing circuit to obtain the final calculation result.
  • the operation instructions obtained by the present disclosure include one of the following combinations according to specific operation operations: preprocessing instructions and slave processing instructions; slave processing instructions and postprocessing instructions; and preprocessing instructions, slave processing instructions, and postprocessing instructions.
  • the preprocessing instruction may include a data conversion instruction and/or a data splicing instruction.
  • the post-processing instructions include one or more of the following: random number processing instructions, addition instructions, subtraction instructions, table lookup instructions, parameter configuration instructions, multiplication instructions, pooling instructions, and activation instructions , comparison instruction, absolute value instruction, logical operation instruction, position index instruction or filter instruction.
  • the slave processing instructions may include various types of operation instructions, including but not limited to instructions similar to post-processing instructions and instructions for complex data processing, such as vector operation instructions or tensors Operation instructions.
  • the present disclosure also discloses a method for performing computing operations using a computing device, wherein the computing device includes a host computer.
  • a processing circuit and at least one slave processing circuit ie, the computing device discussed above in connection with FIGS. 1-5 ), the method includes configuring the master processing circuit to perform a master arithmetic operation in response to a master instruction, and The slave processing circuit is configured to perform slave arithmetic operations in response to the slave instruction.
  • the aforementioned main operation includes a pre-processing operation and/or a post-processing operation for the slave operation, and the main instruction and the slave instruction are obtained by parsing according to the calculation instruction received by the computing device .
  • the operand of the calculation instruction includes a descriptor for indicating the shape of the tensor, and the descriptor is used to determine the storage address of the data corresponding to the operand.
  • the method may further include configuring the master processing circuit and/or the slave processing circuit to perform respective corresponding master arithmetic operations and/or slave processing operations according to the storage address.
  • the efficiency of tensor operation and the speed of data access can be improved, and the overhead of tensor operation can be further reduced.
  • steps of the method are not described here for the purpose of brevity, those skilled in the art can understand from the content disclosed in the present disclosure that the method of the present disclosure can perform various operations described above in conjunction with FIGS. 1-5 . .
  • FIG. 6 is a structural diagram illustrating a combined processing apparatus 600 according to an embodiment of the present disclosure.
  • the combined processing device 600 includes a computing processing device 602 , an interface device 604 , other processing devices 606 and a storage device 608 .
  • one or more computing devices 610 may be included in the computing processing device, and the computing devices may be configured to perform the operations described herein in conjunction with FIG. 1 to FIG. 5 .
  • the computing processing devices of the present disclosure may be configured to perform user-specified operations.
  • the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor.
  • one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core.
  • multiple computing devices are implemented as an artificial intelligence processor core or a part of the hardware structure of an artificial intelligence processor core, for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure.
  • the computing processing apparatus of the present disclosure may interact with other processing apparatuses through an interface apparatus to jointly complete an operation specified by a user.
  • other processing devices of the present disclosure may include central processing units (Central Processing Unit, CPU), graphics processing units (Graphics Processing Unit, GPU), artificial intelligence processors and other general-purpose and/or special-purpose processors.
  • processors may include, but are not limited to, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the computing processing device of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when computing processing devices and other processing devices are considered together, the two can be viewed as forming a heterogeneous multi-core structure.
  • the other processing device may serve as an interface for the computing processing device of the present disclosure (which may be embodied as a related computing device for artificial intelligence such as neural network operations) with external data and control, performing operations including but not limited to Limited to basic controls such as data movement, starting and/or stopping computing devices.
  • other processing apparatuses may also cooperate with the computing processing apparatus to jointly complete computing tasks.
  • the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices.
  • the computing and processing device may obtain input data from other processing devices via the interface device, and write the input data into the on-chip storage device (or memory) of the computing and processing device.
  • the computing and processing device may obtain control instructions from other processing devices via the interface device, and write them into the control cache on the computing and processing device chip.
  • the interface device can also read the data in the storage device of the computing processing device and transmit it to other processing devices.
  • the combined processing device of the present disclosure may also include a storage device.
  • the storage device is connected to the computing processing device and the other processing device, respectively.
  • a storage device may be used to store data of the computing processing device and/or the other processing device.
  • the data may be data that cannot be fully stored in an internal or on-chip storage device of a computing processing device or other processing device.
  • the present disclosure also discloses a chip (eg, chip 702 shown in FIG. 7).
  • the chip is a System on Chip (SoC) and integrates one or more combined processing devices as shown in FIG. 6 .
  • the chip can be connected with other related components through an external interface device (such as the external interface device 706 shown in FIG. 7 ).
  • the relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card or a wifi interface.
  • other processing units such as video codecs
  • interface modules such as DRAM interfaces
  • the present disclosure also discloses a chip package structure including the above-mentioned chip.
  • the present disclosure also discloses a board including the above-mentioned chip package structure. The board will be described in detail below with reference to FIG. 7 .
  • FIG. 7 is a schematic structural diagram illustrating a board 700 according to an embodiment of the present disclosure.
  • the board includes a storage device 704 for storing data, which includes one or more storage units 710 .
  • the storage device can be connected to the control device 708 and the chip 702 described above for connection and data transmission through, for example, a bus.
  • the board also includes an external interface device 706, which is configured for data relay or transfer function between the chip (or a chip in a chip package structure) and an external device 712 (such as a server or a computer, etc.).
  • the data to be processed can be transmitted to the chip by an external device through an external interface device.
  • the calculation result of the chip may be transmitted back to the external device via the external interface device.
  • the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface and the like.
  • control device in the board of the present disclosure may be configured to regulate the state of the chip.
  • control device may include a single-chip microcomputer (Micro Controller Unit, MCU) for regulating the working state of the chip.
  • MCU Micro Controller Unit
  • an electronic device or device which may include one or more of the above-mentioned boards, one or more of the above-mentioned chips and/or one or a plurality of the above-mentioned combined processing devices.
  • the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment.
  • the vehicles include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • the electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal.
  • the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be obtained from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Matching appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.
  • units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units.
  • the aforementioned components or elements may be co-located or distributed over multiple network elements.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.
  • the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the aspects of the present disclosure are embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure.
  • a computer device eg, a personal computer, a server or network equipment, etc.
  • the aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors.
  • the various types of devices described herein may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • a variable resistance memory Resistive Random Access Memory, RRAM
  • Dynamic Random Access Memory Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • HBM Hybrid Memory Cube
  • ROM and RAM etc.
  • a computing device comprising a master processing circuit and at least one slave processing circuit, wherein:
  • the main processing circuit is configured to perform main arithmetic operations in response to a main instruction
  • the slave processing circuit is configured to perform a slave arithmetic operation in response to a slave instruction
  • the master computing operation includes a pre-processing operation and/or a post-processing operation for the slave computing operation
  • the master instruction and the slave instruction are parsed and obtained according to the computing instruction received by the computing device
  • the The operand of the calculation instruction includes a descriptor for indicating the shape of the tensor, and the descriptor is used to determine the storage address of the data corresponding to the operand
  • the master processing circuit and/or the slave processing circuit are configured to perform respective corresponding master arithmetic operations and/or slave processing operations according to the storage addresses.
  • Clause 2 The computing device of clause 1, wherein the computing instruction includes an identification of a descriptor and/or content of the descriptor, the content of the descriptor including at least one shape parameter representing the shape of the tensor data.
  • Clause 3 The computing device of clause 2, wherein the content of the descriptor further comprises at least one address parameter representing an address of tensor data.
  • Clause 4 The computing device of Clause 3, wherein the address parameter of the tensor data comprises a base address of a data base point of the descriptor in the data storage space of the tensor data.
  • Clause 5 The computing device of Clause 4, wherein the shape parameter of the tensor data comprises at least one of the following:
  • the size of the data storage space in at least one of the N-dimensional directions, the size of the storage area of the tensor data in at least one of the N-dimensional directions, the size of the storage area in at least one of the N-dimensional directions The offset in one direction, the position of at least two vertices at diagonal positions in N dimension directions relative to the data reference point, the data description position of the tensor data indicated by the descriptor and the data address.
  • the slave instruction is sent to the slave processing circuit.
  • Clause 7 The computing device according to Clause 1, further comprising a control circuit, the control circuit configured to: obtain the calculation instruction and parse the calculation instruction to obtain the master instruction and the slave instruction; as well as
  • the master instruction is sent to the master processing circuit and the slave instruction is sent to the slave processing circuit.
  • Clause 8 The computing device of clause 1, wherein the host instruction includes an identification bit for identifying the preprocessing operation and/or the postprocessing operation.
  • Clause 9 The computing device of Clause 1, wherein the computing instruction includes a preset bit for distinguishing the preprocessing operation and the postprocessing operation in the main instruction.
  • Clause 10 The computing device of clause 1, wherein the main processing circuit comprises a data processing unit for performing the main arithmetic operation, and the data processing unit comprises a data conversion circuit for performing a data conversion operation and /or a data splicing circuit for performing data splicing operations.
  • Clause 11 The computing device of clause 10, wherein the data conversion circuit includes one or more converters for enabling conversion of computational data between a plurality of different data types.
  • Item 12 The computing device according to Item 10, wherein the data splicing circuit is configured to split the computing data with a predetermined bit length, and splicing a plurality of data blocks obtained after the splitting in a predetermined order.
  • the main processing circuit comprises one or more sets of pipelined arithmetic circuits, each set of pipelined arithmetic circuits forming an arithmetic pipeline and comprising one or more operators, wherein when When each group of pipeline arithmetic circuits includes a plurality of arithmetic units, the plurality of arithmetic units are connected and configured to selectively participate in performing the main arithmetic operation according to the main instruction.
  • Clause 14 The computing device of clause 13, wherein the main processing circuit comprises at least two arithmetic pipelines, and each arithmetic pipeline comprises one or more operators or circuits of:
  • Random number processing circuit addition and subtraction circuit, subtraction circuit, look-up table circuit, parameter configuration circuit, multiplier, divider, pooler, comparator, absolute value circuit, logic operator, position index circuit or filter.
  • the slave processing circuit includes a plurality of arithmetic circuits for performing the slave arithmetic operations, and the plurality of arithmetic circuits are connected and configured to perform multi-stage pipelined An arithmetic operation, wherein the arithmetic circuit includes one or more of a multiplying circuit, a comparing circuit, an accumulating circuit, and a revolution number circuit to perform at least vector operations.
  • slave instruction comprises a convolution instruction that performs a convolution operation on the computed data subjected to the preprocessing operation, the slave processing circuit configured to:
  • a convolution operation is performed on the calculated data subjected to the preprocessing operation according to the convolution instruction.
  • a method of performing computing operations using a computing device wherein the computing device includes a master processing circuit and at least one slave processing circuit, the method comprising:
  • the master operation includes preprocessing and/or postprocessing for the slave operation, and the master instruction and the slave instruction are parsed and obtained according to the calculation instruction received by the computing device, wherein the calculation
  • the operand of the instruction includes a descriptor for indicating the shape of the tensor, and the descriptor is used to determine the storage address of the data corresponding to the operand,
  • the method further includes configuring the master processing circuit and/or the slave processing circuit to perform respective corresponding master arithmetic operations and/or slave processing operations according to the storage address.
  • Clause 21 The method of clause 20, wherein the computation instruction includes an identification of a descriptor and/or content of the descriptor, the content of the descriptor including at least one shape parameter representing the shape of the tensor data.
  • Clause 22 The method of clause 21, wherein the content of the descriptor further comprises at least one address parameter representing an address of tensor data.
  • the size of the data storage space in at least one of the N-dimensional directions, the size of the storage area of the tensor data in at least one of the N-dimensional directions, the size of the storage area in at least one of the N-dimensional directions The offset in one direction, the position of at least two vertices at diagonal positions in N dimension directions relative to the data reference point, the data description position of the tensor data indicated by the descriptor and the data address.
  • the slave instruction is sent to the slave processing circuit.
  • Clause 26 The method of clause 20, wherein the computing device includes a control circuit, the method further comprising configuring the control circuit to:
  • the master instruction is sent to the master processing circuit and the slave instruction is sent to the slave processing circuit.
  • Clause 27 The method of clause 20, wherein the host instruction includes an identification bit for identifying the preprocessing operation and/or the postprocessing operation.
  • Clause 28 The method of clause 20, wherein the computing instruction includes preset bits for distinguishing between the preprocessing operations and the postprocessing operations in the host instruction.
  • Clause 29 The method of clause 20, wherein the main processing circuit comprises a data processing unit, and the data processing unit comprises a data conversion circuit and/or a data splicing circuit, the method comprising configuring the data processing unit to perform The main arithmetic operation is performed, and the data conversion circuit is configured to perform a data conversion operation, and the data stitching circuit is configured to perform a data stitching operation.
  • Clause 30 The method of clause 29, wherein the data conversion circuit comprises one or more converters, the method comprising configuring the one or more converters to enable conversion of computational data between a plurality of different data types. convert.
  • Item 31 The method of Item 29, wherein the data splicing circuit is configured to split the calculated data with a predetermined bit length, and splicing the plurality of data blocks obtained after the splitting in a predetermined order.
  • Clause 32 The method of clause 20, wherein the main processing circuit comprises one or more sets of pipelined arithmetic circuits, each set of pipelined arithmetic circuits forming an arithmetic pipeline and comprising one or more operators, wherein when all When each set of pipeline arithmetic circuits includes a plurality of arithmetic units, the method includes connecting the plurality of arithmetic units and configuring to selectively participate in performing the main arithmetic operation according to the host instruction.
  • Clause 33 The method of clause 32, wherein the main processing circuit comprises at least two arithmetic pipelines, and each arithmetic pipeline comprises one or more operators or circuits of:
  • Random number processing circuit addition and subtraction circuit, subtraction circuit, look-up table circuit, parameter configuration circuit, multiplier, divider, pooler, comparator, absolute value circuit, logic operator, position index circuit or filter.
  • the slave processing circuit comprises a plurality of arithmetic circuits, the method comprising configuring the plurality of arithmetic circuits to perform the slave arithmetic operations, and the method further comprising The plurality of arithmetic circuits are connected and configured to perform multi-stage pipeline arithmetic operations, wherein the arithmetic circuits include one or more of a multiplying circuit, a comparing circuit, an accumulating circuit, and a revolution number circuit to perform at least vector operations.
  • slave instruction comprises a convolution instruction that performs a convolution operation on computational data subjected to the preprocessing operation, the method comprising configuring the slave processing circuit to:
  • a convolution operation is performed on the calculated data subjected to the preprocessing operation according to the convolution instruction.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Advance Control (AREA)

Abstract

A computing apparatus, an integrated circuit chip, a board card, an electronic device, and a method of executing an arithmetic operation using said computing apparatus. The computing apparatus may be comprised in a combined processing apparatus, the combined processing apparatus may further comprise a universal interconnection interface and other processing apparatuses. The computing apparatus interacts with other processing apparatuses to jointly complete a computing operation specified by a user. The combined processing apparatus may further comprise a storage apparatus. The storage apparatus is connected to the device and other processing apparatuses, respectively, and used for storing data of the device and other processing apparatuses.

Description

计算装置、集成电路芯片、板卡、电子设备和计算方法Computing device, integrated circuit chip, board card, electronic device and computing method
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2020年6月30日申请的,申请号为2020106194608,名称为“计算装置、集成电路芯片、板卡、电子设备和计算方法”的中国专利申请的优先权,在此将其全文引入作为参考。This application claims the priority of the Chinese patent application filed on June 30, 2020, the application number is 2020106194608, and the title is "computing device, integrated circuit chip, board card, electronic equipment and computing method", the full text of which is hereby incorporated Incorporated by reference.
技术领域technical field
本披露一般地涉及计算领域。更具体地,本披露涉及一种计算装置、集成电路芯片、板卡、电子设备和计算方法。The present disclosure relates generally to the field of computing. More particularly, the present disclosure relates to a computing device, an integrated circuit chip, a board, an electronic device, and a computing method.
背景技术Background technique
在计算系统中,指令集是用于执行计算和对计算系统进行控制的一套指令的集合,并且在提高计算系统中计算芯片(例如处理器)的性能方面发挥着关键性的作用。当前的各类计算芯片(特别是人工智能领域的芯片)利用相关联的指令集,可以完成各类通用或特定的控制操作和数据处理操作。然而,当前的指令集还存在诸多方面的缺陷。例如,现有的指令集受限于硬件架构而在灵活性方面表现较差。进一步,许多指令仅能完成单一的操作,而多个操作的执行则通常需要多条指令,这潜在地导致片内I/O数据吐吞量增大。另外,当前的指令在执行速度、执行效率和对芯片造成的功耗方面还有改进之处。In a computing system, an instruction set is a set of instructions for performing computations and controlling the computing system, and plays a key role in improving the performance of computing chips (eg, processors) in the computing system. Various current computing chips (especially chips in the field of artificial intelligence) can complete various general or specific control operations and data processing operations by using associated instruction sets. However, the current instruction set still has many defects. For example, the existing instruction set is limited by hardware architecture and is less flexible in terms of flexibility. Further, many instructions can only complete a single operation, and the execution of multiple operations usually requires multiple instructions, which potentially leads to an increase in on-chip I/O data throughput. In addition, the current instructions have improvements in execution speed, execution efficiency, and power consumption on the chip.
另外,传统的处理器CPU的运算指令被设计为能够执行基本的单数据标量运算操作。这里,单数据操作指的是指令的每一个操作数都是一个标量数据。然而,在图像处理和模式识别等任务里,面向的操作数往往是多维向量(即,张量数据)的数据类型,仅仅使用标量操作无法使硬件高效地完成运算任务。因此,如何高效地执行多维的张量运算也是当前计算领域亟需解决的问题。In addition, the arithmetic instructions of conventional processor CPUs are designed to perform basic single-data scalar arithmetic operations. Here, a single data operation means that each operand of the instruction is a scalar data. However, in tasks such as image processing and pattern recognition, the oriented operands are often multi-dimensional vector (ie, tensor data) data types, and only using scalar operations cannot enable hardware to efficiently complete computing tasks. Therefore, how to efficiently perform multi-dimensional tensor operations is also an urgent problem to be solved in the current computing field.
发明内容SUMMARY OF THE INVENTION
为了至少解决上述现有技术中存在的问题,本披露提供一种硬件构架平台和相关指令的解决方案。利用本披露公开的方案,可以增加指令的灵活性,提高指令执行的效率并且降低计算成本和开销。进一步,本披露的方案在前述硬件架构的基础上支持对张量数据的高效访存和处理,从而在计算指令中包括多维向量操作数的情况下,加速张量运算并且减小张量运算所带来的计算开销。In order to at least solve the above-mentioned problems in the prior art, the present disclosure provides a hardware architecture platform and a solution of related instructions. With the solution disclosed in the present disclosure, the flexibility of the instruction can be increased, the efficiency of the instruction execution can be improved, and the calculation cost and overhead can be reduced. Further, the solution of the present disclosure supports efficient memory access and processing of tensor data on the basis of the aforementioned hardware architecture, thereby accelerating tensor operations and reducing the cost of tensor operations when multi-dimensional vector operands are included in the calculation instructions. computational overhead.
在第一方面中,本披露公开了一种计算装置,包括主处理电路和至少一个从处理电路,其中:所述主处理电路配置成响应于主指令来执行主运算操作,所述从处理电路配置成响应于从指令来执行从运算操作,其中,所述主运算操作包括针对于所述从运算操作的前处理操作和/或后处理操作,所述主指令和所述从指令根据所述计算装置接收的计算指令解析得到,其中所述计算指令的操作数包括用于指示张量的形状的描述符,所述描述符用于确定所述操作数对应数据的存储地址,其中所述主处理电路和/或从处理电路配置成根据所述存储地址来执行各自对应的主运算操作和/或从处理操作。In a first aspect, the present disclosure discloses a computing device comprising a master processing circuit and at least one slave processing circuit, wherein: the master processing circuit is configured to perform a master arithmetic operation in response to a master instruction, the slave processing circuit be configured to perform a slave operation operation in response to a slave instruction, wherein the master operation operation includes a pre-processing operation and/or a post-processing operation for the slave operation operation, the master instruction and the slave instruction according to the The calculation instruction received by the computing device is obtained by parsing, wherein the operand of the calculation instruction includes a descriptor used to indicate the shape of the tensor, and the descriptor is used to determine the storage address of the data corresponding to the operand, wherein the main The processing circuit and/or the slave processing circuit are configured to perform respective corresponding master arithmetic operations and/or slave processing operations according to the storage address.
在第二方面中,本披露公开了一种集成电路芯片,其包括前一方面中提及并且在 稍后的多个实施例中描述的计算装置。In a second aspect, the present disclosure discloses an integrated circuit chip comprising the computing device mentioned in the previous aspect and described in the various embodiments that follow.
在第三方面中,本披露公开了一种板卡,其包括前一方面中提及并且在稍后的多个实施例中描述的集成电路芯片。In a third aspect, the present disclosure discloses a board including the integrated circuit chip mentioned in the previous aspect and described in various embodiments later.
在第四方面中,本披露公开了一种电子设备,其包括前一方面中提及并且在稍后的多个实施例中描述的集成电路芯片。In a fourth aspect, the present disclosure discloses an electronic device comprising the integrated circuit chip mentioned in the previous aspect and described in the various embodiments that follow.
在第五方面中,本披露公开了一种使用前述的计算装置来执行计算操作的方法,其中所述计算装置包括主处理电路和至少一个从处理电路,所述方法包括:将所述主处理电路配置成响应于主指令来执行主运算操作,将所述从处理电路配置成响应于从指令来执行从运算操作,其中所述主运算操作包括针对于所述从运算操作的前处理操作和/或后处理操作,所述主指令和所述从指令根据所述计算装置接收的计算指令解析得到,其中所述计算指令的操作数包括用于指示张量的形状的描述符,所述描述符用于确定所述操作数对应数据的存储地址,其中所述方法还包括将所述主处理电路和/或从处理电路配置成根据所述存储地址来执行各自对应的主运算操作和/或从处理操作。In a fifth aspect, the present disclosure discloses a method of performing a computing operation using the aforementioned computing device, wherein the computing device includes a master processing circuit and at least one slave processing circuit, the method comprising: converting the master processing circuit The circuit is configured to perform a master arithmetic operation in response to a master instruction, and the slave processing circuit is configured to perform a slave arithmetic operation in response to the slave instruction, wherein the master arithmetic operation includes a preprocessing operation for the slave arithmetic operation and /or post-processing operation, the master instruction and the slave instruction are parsed according to the calculation instruction received by the computing device, wherein the operand of the calculation instruction includes a descriptor for indicating the shape of the tensor, the description The operator is used to determine the storage address of the data corresponding to the operand, wherein the method further includes configuring the master processing circuit and/or the slave processing circuit to perform the respective corresponding master operation and/or according to the storage address. from the processing operation.
利用本披露公开的计算装置、集成电路芯片、板卡、电子设备和方法,可以高效地执行与主运算操作和从运算操作相关的主指令和从指令,从而加速运算操作的执行。进一步,由于主运算操作和从运算操作的结合,使得本披露的计算装置可以支持更多类型的运算和操作。另外,基于本披露的计算装置的流水线运算布置,可以对计算指令进行灵活地配置以满足计算的要求。Using the computing device, integrated circuit chip, board, electronic device and method disclosed in the present disclosure, the master and slave instructions related to the master operation and the slave operation can be efficiently executed, thereby speeding up the execution of the operation. Further, due to the combination of the master operation and the slave operation, the computing device of the present disclosure can support more types of operations and operations. In addition, based on the pipeline operation arrangement of the computing device of the present disclosure, the computing instructions can be flexibly configured to meet computing requirements.
附图说明Description of drawings
通过参考附图阅读下文的详细描述,本披露示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本披露的若干实施方式,并且相同或对应的标号表示相同或对应的部分其中:The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the accompanying drawings, several embodiments of the present disclosure are shown by way of example and not limitation, and like or corresponding reference numerals refer to like or corresponding parts wherein:
图1a是示出根据本披露实施例的计算装置的概略图;1a is a schematic diagram illustrating a computing device according to an embodiment of the present disclosure;
图1b是示出根据本披露实施例的数据存储空间的示意图;Figure 1b is a schematic diagram illustrating a data storage space according to an embodiment of the present disclosure;
图2是示出根据本披露实施例的计算装置的框图;2 is a block diagram illustrating a computing device according to an embodiment of the present disclosure;
图3是示出根据本披露实施例的计算装置的主处理电路的框图;3 is a block diagram illustrating a main processing circuit of a computing device according to an embodiment of the present disclosure;
图4a,4b和4c是示出根据本披露实施例的数据转换电路所执行的矩阵转换示意图;4a, 4b and 4c are schematic diagrams illustrating matrix conversion performed by a data conversion circuit according to an embodiment of the present disclosure;
图5是示出根据本披露实施例的计算装置的从处理电路的框图;5 is a block diagram illustrating a slave processing circuit of a computing device according to an embodiment of the present disclosure;
图6是示出根据本披露实施例的一种组合处理装置的结构图;以及FIG. 6 is a block diagram illustrating a combined processing apparatus according to an embodiment of the present disclosure; and
图7是示出根据本披露实施例的一种板卡的结构示意图。FIG. 7 is a schematic structural diagram illustrating a board according to an embodiment of the present disclosure.
具体实施方式detailed description
本披露的方案利用主处理电路和至少一个从处理电路的硬件架构来执行相关联的数据操作,从而可以利用相对灵活、简化的计算指令来完成相对复杂的运算。具体来说,本披露的方案利用从计算指令解析获得的主指令和从指令,并且令主处理电路执行主指令以实现主运算操作,而令从处理电路执行从指令以实现从运算操作,以便实现包括例如向量运算的各种复杂运算。这里,主运算操作可以包括针对于从运算操作的前处理操作和/或后处理操作。在一个实施例中,该前处理操作可以例如是数据转 换操作和/或数据拼接操作。在另一个实施例中,该后处理操作可以例如是对从处理电路输出结果的算术操作。在一些场景中,当计算指令的操作数包括用于指示张量的形状的描述符时,本披露的方案利用该描述符来确定所述操作数对应数据的存储地址。基于此,所述主处理电路和/或从处理电路可以配置成根据所述存储地址来执行各自对应的主运算操作和/或从运算操作,其中主运算操作和/或从运算操作可以涉及张量数据的各类运算操作。另外,根据主处理电路中运算电路或运算器的不同,本披露的计算指令支持灵活和个性化的配置,以满足不同的应用场景。The solution of the present disclosure utilizes the hardware architecture of a master processing circuit and at least one slave processing circuit to perform associated data operations, so that relatively complex operations can be performed using relatively flexible and simplified computing instructions. Specifically, the solution of the present disclosure utilizes the master instruction and the slave instruction obtained by parsing the computing instruction, and causes the master processing circuit to execute the master instruction to realize the master operation, and the slave processing circuit to execute the slave instruction to realize the slave operation, so as to Implements various complex operations including, for example, vector operations. Here, the master arithmetic operation may include a pre-processing operation and/or a post-processing operation with respect to the slave arithmetic operation. In one embodiment, the preprocessing operation may be, for example, a data conversion operation and/or a data concatenation operation. In another embodiment, the post-processing operation may be, for example, an arithmetic operation on the result output from the processing circuit. In some scenarios, when an operand of a computation instruction includes a descriptor for indicating the shape of a tensor, the disclosed scheme utilizes the descriptor to determine the storage address of the data corresponding to the operand. Based on this, the master processing circuit and/or the slave processing circuit may be configured to perform respective corresponding master operation and/or slave operation according to the storage address, wherein the master operation and/or the slave operation may involve a Various operations on data. In addition, according to different arithmetic circuits or arithmetic units in the main processing circuit, the computing instructions of the present disclosure support flexible and personalized configurations to meet different application scenarios.
下面将结合附图对本披露的技术方案进行详细地描述。The technical solutions of the present disclosure will be described in detail below with reference to the accompanying drawings.
图1a是示出根据本披露实施例的计算装置100的概略图。如图1a中所示,计算装置100可以包括主处理电路102和从处理电路,例如图中所示出的从处理电路104、106和108。尽管这里示出三个从处理电路,但本领域技术人员可以理解本披露的计算装置100可以包括任意合适数目的从处理电路,且多个从处理电路之间、多个从处理电路和主处理电路之间可以以不同方式连接,本披露不做任何的限制。在一个或多个实施例中,本披露的多个从处理电路可以并行地执行各类从指令(例如由计算指令解析获得),以提高计算装置的处理效率。FIG. 1a is a schematic diagram illustrating a computing device 100 according to an embodiment of the present disclosure. As shown in FIG. 1a, computing device 100 may include a master processing circuit 102 and slave processing circuits, such as slave processing circuits 104, 106 and 108 shown in the figure. Although three slave processing circuits are shown here, those skilled in the art will understand that the computing device 100 of the present disclosure may include any suitable number of slave processing circuits, and between multiple slave processing circuits, multiple slave processing circuits and a master processing circuit The circuits may be connected in different ways, and this disclosure does not make any limitation. In one or more embodiments, the multiple slave processing circuits of the present disclosure can execute various slave instructions (eg, obtained by parsing the computing instructions) in parallel, so as to improve the processing efficiency of the computing device.
在本披露的上下文中,计算指令可以是软件和硬件的交互接口的指令系统中的指令,其可以是二进制或其他形式的、供处理器(或称处理电路)等硬件接收并处理的机器语言。计算指令可以包括用于指示处理器操作的操作码和操作数。根据不同的应用场景,计算指令可以包括一个或多个操作码,而当前述计算指令包括一个操作码时,该操作码可以用于指示处理器的多个操作。另外,计算指令还可以包括一个或多个操作数。根据本披露的方案,操作数可以包括用于指示张量的形状的描述符,该描述符可以用于确定所述操作数对应数据的存储地址。In the context of the present disclosure, a computing instruction may be an instruction in an instruction system of an interactive interface between software and hardware, which may be a binary or other form of machine language that is received and processed by hardware such as a processor (or processing circuit) . Compute instructions may include opcodes and operands for instructing the processor to operate. According to different application scenarios, the calculation instruction may include one or more operation codes, and when the foregoing calculation instruction includes one operation code, the operation code may be used to instruct multiple operations of the processor. Additionally, a compute instruction may also include one or more operands. According to the solution of the present disclosure, the operand may include a descriptor for indicating the shape of the tensor, and the descriptor may be used to determine the storage address of the data corresponding to the operand.
在一个实施例中,可以通过解析计算装置接收到的计算指令来获得主指令和从指令。在操作中,主处理电路可以配置成响应于主指令来执行主运算操作,而所述从处理电路可以配置成响应于从指令来执行从运算操作。根据本披露的方案,前述的主指令或从指令可以是在处理器内部运行的微指令或控制信号,并且可以包括(或者说指示)一个或多个操作。当计算指令的操作数包括如前所述的描述符时,主处理电路和/或从处理电路可以配置成根据基于描述符所获得的存储地址来对张量进行访存。通过基于描述符的访存机制,本披露的方案在执行张量的运算中能够显著提升张量数据的读取和存储速度,由此加速计算并减小计算开销。In one embodiment, the master instruction and the slave instruction may be obtained by parsing the computing instruction received by the computing device. In operation, the master processing circuit may be configured to perform a master arithmetic operation in response to a master instruction, and the slave processing circuit may be configured to perform a slave arithmetic operation in response to a slave instruction. According to the solution of the present disclosure, the aforementioned master instruction or slave instruction may be a microinstruction or a control signal running inside the processor, and may include (or instruct) one or more operations. When the operands of the compute instruction include descriptors as previously described, the master processing circuit and/or the slave processing circuit may be configured to fetch the tensors according to the memory addresses obtained based on the descriptors. Through the descriptor-based memory access mechanism, the solution of the present disclosure can significantly improve the reading and storage speed of tensor data in performing tensor operations, thereby speeding up computation and reducing computation overhead.
在一个实施例中,前述主运算操作可以包括针对于所述从运算操作的前处理操作和/或后处理操作。具体而言,对于由主处理电路执行的主指令,其可以包括例如对将参与运算的数据进行数据转换和/或数据拼接的前处理操作。在一些应用场景中,主指令也可以包括仅仅对数据的选择性读取的前处理操作,例如将专用或私有缓存器中所存储的数据读出并发送到从处理电路,或者为从处理电路的运算生成相应的随机数。在另外一些应用场景中,依据主处理电路中包括的运算器的类型和数目,所述主指令可以包括一个或多个与运算器的功能相关联的多个后处理操作。例如,主指令可以包括对从处理电路执行从指令后获得的中间运算结果或最终运算结果进行相加、相乘、查表、比较、求平均、过滤等多种类型的操作。在一些应用场景中,前述的中间运算结果或最终运算结果可以是前述的张量,并且其存储地址可以根据本披露的描述符来 获得。In one embodiment, the aforementioned master arithmetic operations may include pre-processing operations and/or post-processing operations for the slave arithmetic operations. Specifically, for the main instruction executed by the main processing circuit, it may include, for example, a preprocessing operation of performing data conversion and/or data splicing on the data to be involved in the operation. In some application scenarios, the master instruction may also include preprocessing operations that only selectively read data, such as reading and sending data stored in a dedicated or private buffer to the slave processing circuit, or for the slave processing circuit The operation generates the corresponding random number. In other application scenarios, depending on the type and number of operators included in the main processing circuit, the main instruction may include one or more post-processing operations associated with the functions of the operators. For example, the main instruction may include various types of operations such as addition, multiplication, table lookup, comparison, averaging, filtering, etc., on the intermediate or final operation result obtained after the slave processing circuit executes the slave instruction. In some application scenarios, the aforementioned intermediate operation result or final operation result may be the aforementioned tensor, and its storage address may be obtained according to the descriptor of the present disclosure.
为了便于识别前处理操作和/或后处理操作,在一些应用场景中,所述主指令可以包括用于标识所述前处理操作和/或后处理操作的标识位。由此,当获取主指令时,主处理电路可以根据所述标识位来确定对运算数据执行前处理操作还是后处理操作。附加地或替代地,可以通过所述计算指令的预设位置(或称指令域段)来对所述主指令中的所述前处理操作和后处理操作进行匹分。例如,当计算指令中设置有包括(主指令+从指令)的预设位置时,则可以确定此计算指令中的主指令涉及针对于从操作的前处理操作。又例如,当计算指令中设置有包括(从指令+主指令)的预设位置时,则可以确定此计算指令中的主指令涉及对从操作的后处理操作。为了便于理解,假设计算指令具有三段预定位宽(即前述的预设位置)的长度,则可以将位于第一段预定位宽的指令指定为用于前处理操作的主指令,位于中间位置的第二段预定位宽的指令指定为用于从操作的从指令,而位于最后位置的第三段预定位宽的指令指定为用于后处理操作的主指令。In order to facilitate the identification of the pre-processing operation and/or the post-processing operation, in some application scenarios, the main instruction may include an identification bit for identifying the pre-processing operation and/or the post-processing operation. Therefore, when acquiring the main instruction, the main processing circuit can determine whether to perform a pre-processing operation or a post-processing operation on the operation data according to the identification bit. Additionally or alternatively, the pre-processing operation and the post-processing operation in the main instruction may be matched by a preset position (or called an instruction field segment) of the calculation instruction. For example, when a preset position including (master instruction+slave instruction) is set in the calculation instruction, it can be determined that the master instruction in the calculation instruction involves the preprocessing operation for the slave operation. For another example, when a preset position including (slave instruction+master instruction) is set in the calculation instruction, it can be determined that the master instruction in the calculation instruction involves the post-processing operation on the slave operation. For ease of understanding, it is assumed that the calculation instruction has a length of three predetermined bit widths (that is, the aforementioned preset positions), then the instruction located in the first predetermined bit width can be designated as the main instruction for the preprocessing operation, located in the middle position The second segment of pre-width instructions is designated as a slave instruction for slave operations, and the third segment of pre-width instructions in the last position is designated as a master instruction for post-processing operations.
对于由从处理电路执行的从指令,其可以包括与从处理电路中的一个或多个运算电路的功能关联的一个或多个操作。所述从指令可以包括对经主处理电路执行前处理操作后的数据执行运算的操作。在一些应用场景中,所述从指令可以包括算术运算、逻辑运算、数据类型转换等各种操作。例如,从指令可以包括对经所述前处理操作后的数据执行向量相关的各类乘加操作,包括例如卷积操作。在另一些应用场景中,当前述计算指令中不包括关于前处理操作的主指令时,从处理电路也可以根据从指令来对输入数据直接进行从运算操作。For a slave instruction to be executed by the slave processing circuit, it may include one or more operations associated with the functionality of one or more arithmetic circuits in the slave processing circuit. The slave instruction may include an operation for performing an operation on the data after the preprocessing operation is performed by the master processing circuit. In some application scenarios, the slave instruction may include various operations such as arithmetic operation, logical operation, and data type conversion. For example, the slave instruction may include performing various vector-related multiply-add operations on the data subjected to the preprocessing operation, including, for example, convolution operations. In other application scenarios, when the aforementioned calculation instruction does not include a master instruction related to the preprocessing operation, the slave processing circuit may also directly perform a slave operation on the input data according to the slave instruction.
在一个或多个实施例中,主处理电路102可以配置成获得计算指令并且对其进行解析,从而得到前述的主指令和从指令,并且将从指令发送到从处理电路。具体来说,主处理电路可以包括用于解析计算指令的一个或多个译码电路(或称译码器)。通过内部的译码电路,主处理电路可以将接收到的计算指令解析成一个或多个主指令和/或从指令,并且将相应的从指令发送到从处理电路,以便从处理电路执行从运算操作。这里,根据应用场景的不同,可以以多种方式将从指令发送到从处理电路。例如,当计算装置中包括存储电路时,主处理电路可以将从指令发送到存储电路,并且经由存储电路来向从处理电路发送。又例如,当多个从处理电路执行并行操作时,主处理电路可以向多个从处理电路广播相同的从指令。附加地或可选地,在一些硬件架构场景中,计算装置还可以包括单独的电路、单元或模块来专用于对计算装置接收到的计算指令进行解析,如稍后结合图2所描述的架构。In one or more embodiments, the master processing circuit 102 may be configured to obtain and parse the computational instructions to obtain the aforementioned master and slave instructions, and to send the slave instructions to the slave processing circuits. Specifically, the main processing circuit may include one or more decoding circuits (or decoders) for parsing computational instructions. Through the internal decoding circuit, the master processing circuit can parse the received calculation instruction into one or more master instructions and/or slave instructions, and send the corresponding slave instructions to the slave processing circuit, so that the slave processing circuit can execute the slave operation. operate. Here, depending on the application scenario, the slave instruction can be sent to the slave processing circuit in various ways. For example, when storage circuitry is included in a computing device, the master processing circuitry may send slave instructions to the storage circuitry, and via the storage circuitry to slave processing circuitry. For another example, when multiple slave processing circuits perform parallel operations, the master processing circuit may broadcast the same slave instruction to the multiple slave processing circuits. Additionally or alternatively, in some hardware architecture scenarios, the computing device may further include a separate circuit, unit or module dedicated to parsing the computing instructions received by the computing device, such as the architecture described later in conjunction with FIG. 2 . .
在一个或多个实施例中,本披露的从处理电路可以包括用于执行从运算操作的多个运算电路,其中所述多个运算电路可以被连接并且配置成执行多级流水的运算操作。根据运算场景的不同,运算电路可以包括用于至少执行向量运算的乘法电路、比较电路、累加电路和转数电路中的一个或多个。在一个实施例中,当本披露的计算装置应用于人工智能领域的计算时,从处理电路可以根据从指令来执行神经网络中的多维卷积运算。In one or more embodiments, the slave processing circuits of the present disclosure may include a plurality of arithmetic circuits for performing slave arithmetic operations, wherein the plurality of arithmetic circuits may be connected and configured to perform multi-stage pipelined arithmetic operations. Depending on the operation scenario, the operation circuit may include one or more of a multiplication circuit, a comparison circuit, an accumulation circuit, and a rotation circuit for performing at least vector operations. In one embodiment, when the computing device of the present disclosure is applied to computing in the field of artificial intelligence, the slave processing circuit can perform the multi-dimensional convolution operation in the neural network according to the slave instruction.
如前所述,本披露的主运算操作和/或从运算操作还可以包括针对张量数据的各种类型的操作,为此本披露的方案提出利用描述符来获取关于张量形状相关的信息,以便确定张量数据的存储地址,从而通过前述的存储地址来获取和保存张量数据。As mentioned above, the master operation and/or the slave operation of the present disclosure may also include various types of operations on tensor data, for which the solution of the present disclosure proposes to use descriptors to obtain information about the shape of the tensor , so as to determine the storage address of the tensor data, so as to obtain and save the tensor data through the aforementioned storage address.
在一种可能的实现方式中,可以用描述符指示N维的张量数据的形状,N为正整数,例如N=1、2或3,或者为零。其中,张量可以包含多种形式的数据组成方式,张量可以是不同维度的,比如标量可以看作是0维张量,向量可以看作1维张量,而矩阵可以是2维或2维以上的张量。张量的形状包括张量的维度、张量各个维度的尺寸等信息。举例而言,对于张量:In a possible implementation, the shape of the N-dimensional tensor data can be indicated by a descriptor, where N is a positive integer, such as N=1, 2, or 3, or zero. Among them, tensors can contain various forms of data composition. Tensors can be of different dimensions. For example, scalars can be regarded as 0-dimensional tensors, vectors can be regarded as 1-dimensional tensors, and matrices can be 2-dimensional or 2-dimensional. A tensor of more than dimensionality. The shape of a tensor includes information such as the dimension of the tensor and the dimensions of each dimension of the tensor. For example, for tensors:
Figure PCTCN2021095705-appb-000001
Figure PCTCN2021095705-appb-000001
该张量的形状可以被描述符描述为(2,4),也即通过两个参数表示该张量为二维张量,且该张量的第一维度(列)的尺寸为2、第二维度(行)的尺寸为4。需要说明的是,本申请对于描述符指示张量形状的方式并不做限定。The shape of the tensor can be described by the descriptor as (2, 4), that is, two parameters indicate that the tensor is a two-dimensional tensor, and the size of the first dimension (column) of the tensor is 2, the first dimension The dimension of the two dimensions (rows) is 4. It should be noted that the present application does not limit the manner in which the descriptor indicates the shape of the tensor.
在一种可能的实现方式中,N的取值可根据张量数据的维数(阶数)来确定,也可以根据张量数据的使用需要进行设定。例如,在N的取值为3时,张量数据为三维的张量数据,描述符可用来指示该三维的张量数据在三个维度方向上的形状(例如偏移量、尺寸等)。应当理解,本领域技术人员可以根据实际需要对N的取值进行设置,本披露对此不作限制。In a possible implementation manner, the value of N may be determined according to the dimension (order) of the tensor data, and may also be set according to the usage requirements of the tensor data. For example, when the value of N is 3, the tensor data is three-dimensional tensor data, and the descriptor can be used to indicate the shape (eg offset, size, etc.) of the three-dimensional tensor data in the three-dimensional direction. It should be understood that those skilled in the art can set the value of N according to actual needs, which is not limited in the present disclosure.
在一种可能的实现方式中,所述描述符可包括描述符的标识和/或描述符的内容。其中,描述符的标识用于对描述符进行区分,例如描述符的标识可以为其编号;描述符的内容可以包括表示张量数据的形状的至少一个形状参数。例如,张量数据为3维数据,在该张量数据的三个维度中,其中两个维度的形状参数固定不变,其描述符的内容可包括表示该张量数据的另一个维度的形状参数。In a possible implementation manner, the descriptor may include the identifier of the descriptor and/or the content of the descriptor. The identifier of the descriptor is used to distinguish the descriptor, for example, the identifier of the descriptor may be numbered; the content of the descriptor may include at least one shape parameter representing the shape of the tensor data. For example, the tensor data is 3-dimensional data. Among the three dimensions of the tensor data, the shape parameters of two dimensions are fixed, and the content of the descriptor may include the shape representing the other dimension of the tensor data. parameter.
在一种可能的实现方式中,描述符的标识和/或内容可存储在描述符存储空间(内部存储器),例如寄存器、片上的SRAM或其他介质缓存等。描述符所指示的张量数据可存储在数据存储空间(内部存储器或外部存储器),例如片上缓存或片下存储器等。本披露对描述符存储空间及数据存储空间的具体位置不作限制。In one possible implementation, the identifier and/or content of the descriptor may be stored in a descriptor storage space (internal memory), such as a register, on-chip SRAM or other medium caches, and the like. The tensor data indicated by the descriptor can be stored in data storage space (internal memory or external memory), such as on-chip cache or off-chip memory, etc. The present disclosure does not limit the specific locations of the descriptor storage space and the data storage space.
在一种可能的实现方式中,描述符的标识、内容以及描述符所指示的张量数据可以存储在内部存储器的同一块区域,例如,可使用片上缓存的一块连续区域来存储描述符的相关内容,其地址为ADDR0-ADDR1023。其中,可将地址ADDR0-ADDR63作为描述符存储空间,存储描述符的标识和内容,地址ADDR64-ADDR1023作为数据存储空间,存储描述符所指示的张量数据。在描述符存储空间中,可用地址ADDR0-ADDR31存储描述符的标识,地址ADDR32-ADDR63存储描述符的内容。应当理解,地址ADDR并不限于1位或一个字节,此处用来表示一个地址,是一个地址单位。本领域技术人员可以实际情况确定描述符存储空间、数据存储空间以及其具体地址,本披露对此不作限制。In a possible implementation, the identifier, content of the descriptor, and tensor data indicated by the descriptor can be stored in the same area of the internal memory, for example, a continuous area of the on-chip cache can be used to store the related information of the descriptor content, its address is ADDR0-ADDR1023. Among them, the addresses ADDR0-ADDR63 can be used as the descriptor storage space to store the identifier and content of the descriptor, and the addresses ADDR64-ADDR1023 can be used as the data storage space to store the tensor data indicated by the descriptor. In the descriptor storage space, addresses ADDR0-ADDR31 can be used to store the identifier of the descriptor, and addresses ADDR32-ADDR63 can be used to store the content of the descriptor. It should be understood that the address ADDR is not limited to 1 bit or one byte, and is used here to represent an address, which is an address unit. Those skilled in the art can determine the descriptor storage space, the data storage space and their specific addresses according to actual conditions, which are not limited in this disclosure.
在一种可能的实现方式中,描述符的标识、内容以及描述符所指示的张量数据可以存储在内部存储器的不同区域。例如,可以将寄存器作为描述符存储空间,在寄存器中存储描述符的标识及内容,将片上缓存作为数据存储空间,存储描述符所指示的张量数据。In one possible implementation, the identifier, content of the descriptor, and tensor data indicated by the descriptor may be stored in different areas of the internal memory. For example, a register can be used as a descriptor storage space to store the identifier and content of the descriptor in the register, and an on-chip cache can be used as a data storage space to store the tensor data indicated by the descriptor.
在一种可能的实现方式中,在使用寄存器存储描述符的标识和内容时,可以使用寄存器的编号来表示描述符的标识。例如,寄存器的编号为0时,其存储的描述符的标识设置为0。当寄存器中的描述符有效时,可根据描述符所指示的张量数据的大小在缓存空间中分配一块区域用于存储该张量数据。In a possible implementation manner, when a register is used to store the identifier and content of the descriptor, the number of the register may be used to represent the identifier of the descriptor. For example, when the number of the register is 0, the identifier of the descriptor it stores is set to 0. When the descriptor in the register is valid, an area can be allocated in the cache space for storing the tensor data according to the size of the tensor data indicated by the descriptor.
在一种可能的实现方式中,描述符的标识及内容可存储在内部存储器,描述符所指示的张量数据可存储在外部存储器。例如,可以采用在片上存储描述符的标识及内容、在片下存储描述符所指示的张量数据的方式。In a possible implementation manner, the identifier and content of the descriptor may be stored in an internal memory, and the tensor data indicated by the descriptor may be stored in an external memory. For example, the identifier and content of the descriptor can be stored on-chip, and the tensor data indicated by the descriptor can be stored off-chip.
在一种可能的实现方式中,与各描述符对应的数据存储空间的数据地址可以是固定地址。例如,可以为张量数据划分单独的数据存储空间,每个张量数据在数据存储空间的起始地址与描述符一一对应。在这种情况下,控制电路根据描述符即可确定与操作数对应的数据在数据存储空间中的数据地址。In a possible implementation manner, the data address of the data storage space corresponding to each descriptor may be a fixed address. For example, a separate data storage space can be divided for tensor data, and the starting address of each tensor data in the data storage space corresponds to a descriptor one-to-one. In this case, the control circuit can determine the data address in the data storage space of the data corresponding to the operand according to the descriptor.
在一种可能的实现方式中,在与描述符对应的数据存储空间的数据地址为可变地址时,所述描述符还可用于指示N维的张量数据的地址,其中,所述描述符的内容还可包括表示张量数据的地址的至少一个地址参数。例如,张量数据为3维数据,在描述符指向该张量数据的地址时,描述符的内容可包括表示该张量数据的地址的一个地址参数,例如张量数据的起始物理地址,也可以包括该张量数据的地址的多个地址参数,例如张量数据的起始地址+地址偏移量,或张量数据基于各维度的地址参数。本领域技术人员可以根据实际需要对地址参数进行设置,本披露对此不作限制。In a possible implementation manner, when the data address of the data storage space corresponding to the descriptor is a variable address, the descriptor may also be used to indicate the address of N-dimensional tensor data, wherein the descriptor The content of can also include at least one address parameter representing the address of the tensor data. For example, if the tensor data is 3-dimensional data, when the descriptor points to the address of the tensor data, the content of the descriptor may include an address parameter indicating the address of the tensor data, such as the starting physical address of the tensor data, It may also include multiple address parameters of the address of the tensor data, such as the start address + address offset of the tensor data, or address parameters of the tensor data based on each dimension. Those skilled in the art can set address parameters according to actual needs, which are not limited in the present disclosure.
在一种可能的实现方式中,所述张量数据的地址参数可以包括所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址。其中,基准地址可根据数据基准点的变化而不同。本披露对数据基准点的选取不作限制。In a possible implementation manner, the address parameter of the tensor data may include a reference address of the data reference point of the descriptor in the data storage space of the tensor data. Among them, the reference address can be different according to the change of the data reference point. This disclosure does not limit the selection of data benchmarks.
在一种可能的实现方式中,所述基准地址可包括所述数据存储空间的起始地址。在描述符的数据基准点是数据存储空间的第一个数据块时,描述符的基准地址即为数据存储空间的起始地址。在描述符的数据基准点是数据存储空间中第一个数据块以外的其他数据时,描述符的基准地址即为该数据块在数据存储空间中的地址。In a possible implementation manner, the reference address may include a start address of the data storage space. When the data reference point of the descriptor is the first data block of the data storage space, the reference address of the descriptor is the starting address of the data storage space. When the data reference point of the descriptor is other data than the first data block in the data storage space, the reference address of the descriptor is the address of the data block in the data storage space.
在一种可能的实现方式中,所述张量数据的形状参数包括以下至少一种:所述数据存储空间在N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的偏移量、处于N个维度方向的对角位置的至少两个顶点相对于所述数据基准点的位置、所述描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系。其中,数据描述位置是描述符所指示的张量数据中的点或区域的映射位置,例如,张量数据为3维数据时,描述符可使用三维空间坐标(x,y,z)来表示该张量数据的形状,该张量数据的数据描述位置可以是使用三维空间坐标(x,y,z)表示的、该张量数据映射在三维空间中的点或区域的位置。In a possible implementation manner, the shape parameter of the tensor data includes at least one of the following: the size of the data storage space in at least one direction of N dimensions, the size of the storage area in N dimensions The size in at least one direction of , the offset of the storage area in at least one direction of N dimensions, the position of at least two vertices at the diagonal positions of N dimensions relative to the data reference point , the mapping relationship between the data description location of the tensor data indicated by the descriptor and the data address. The data description position is the mapping position of the point or area in the tensor data indicated by the descriptor. For example, when the tensor data is 3-dimensional data, the descriptor can be represented by three-dimensional space coordinates (x, y, z). The shape of the tensor data, and the data description position of the tensor data may be the position of a point or area in the three-dimensional space that the tensor data is mapped to, represented by three-dimensional space coordinates (x, y, z).
应当理解,本领域技术人员可以根据实际情况选择表示张量数据的形状参数,本披露对此不作限制。通过在数据存取过程中使用描述符,可建立数据之间的关联,从而降低数据存取的复杂度,提高指令处理效率。It should be understood that those skilled in the art can select the shape parameters representing the tensor data according to the actual situation, which is not limited in the present disclosure. By using descriptors in the data access process, the association between data can be established, thereby reducing the complexity of data access and improving the efficiency of instruction processing.
在一种可能的实现方式中,可根据所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址、所述数据存储空间的N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的尺寸和/或所述存储区域在N个维度方向的至少一个方向上的偏移量,确定所述张量数据的描述符的内容。In a possible implementation manner, the reference address of the data reference point of the descriptor in the data storage space of the tensor data, the reference address of the data storage space in at least one of the N dimension directions of the data storage space may be used in a possible implementation manner. The size, the size of the storage area in at least one of the N-dimensional directions, and/or the offset of the storage area in at least one of the N-dimensional directions, determine the size of the descriptor of the tensor data. content.
图1b示出根据本披露实施例的数据存储空间的示意图。如图1b所示,数据存储空间21采用行优先的方式存储了一个二维数据,可通过(x,y)来表示(其中,X轴水平向右,Y轴垂直向下),X轴方向上的尺寸(每行的尺寸)为ori_x(图中未示 出),Y轴方向上的尺寸(总行数)为ori_y(图中未示出),数据存储空间21的起始地址PA_start(基准地址)为第一个数据块22的物理地址。数据块23是数据存储空间21中的部分数据,其在X轴方向上的偏移量25表示为offset_x,在Y轴方向上的偏移量24表示为offset_y,在X轴方向上的尺寸表示为size_x,在Y轴方向上的尺寸表示为size_y。Figure 1b shows a schematic diagram of a data storage space according to an embodiment of the present disclosure. As shown in Figure 1b, the data storage space 21 stores a two-dimensional data in a row-first manner, which can be represented by (x, y) (where the X axis is horizontally to the right, and the Y axis is vertically downward), and the X axis direction The size (the size of each line) on the ori_x (not shown in the figure), the size in the Y-axis direction (the total number of lines) is ori_y (not shown in the figure), the starting address of the data storage space 21 PA_start (reference address) is the physical address of the first data block 22. The data block 23 is part of the data in the data storage space 21, the offset 25 in the X-axis direction is represented as offset_x, the offset 24 in the Y-axis direction is represented as offset_y, and the size in the X-axis direction is represented by is size_x, and the size in the Y-axis direction is represented by size_y.
在一种可能的实现方式中,使用描述符来定义数据块23时,描述符的数据基准点可使用数据存储空间21的第一个数据块,可以约定描述符的基准地址为数据存储空间21的起始地址PA_start。然后可以结合数据存储空间21在X轴的尺寸ori_x、在Y轴上的尺寸ori_y,以及数据块23在Y轴方向的偏移量offset_y、X轴方向上的偏移量offset_x、X轴方向上的尺寸size_x以及Y轴方向上的尺寸size_y来确定数据块23的描述符的内容。In a possible implementation manner, when a descriptor is used to define the data block 23, the data reference point of the descriptor may use the first data block of the data storage space 21, and the reference address of the descriptor may be agreed as the data storage space 21 The starting address of PA_start. Then the size ori_x on the X axis, the size ori_y on the Y axis of the data storage space 21, and the offset amount offset_y in the Y axis direction, the offset amount offset_x in the X axis direction, and the offset amount in the X axis direction of the data block 23 can be combined. The content of the descriptor of the data block 23 is determined by the size size_x and the size size_y in the Y-axis direction.
在一种可能的实现方式中,可以使用下述公式(1)来表示描述符的内容:In a possible implementation, the following formula (1) can be used to represent the content of the descriptor:
Figure PCTCN2021095705-appb-000002
Figure PCTCN2021095705-appb-000002
应当理解,虽然上述示例中,描述符的内容表示的是二维空间,但本领域技术人员可以根据实际情况对描述符的内容表示的具体维度进行设置,本披露对此不作限制。It should be understood that although in the above example, the content of the descriptor represents a two-dimensional space, those skilled in the art can set the specific dimension of the content of the descriptor according to the actual situation, which is not limited in the present disclosure.
在一种可能的实现方式中,可约定所述描述符的数据基准点在所述数据存储空间中的基准地址,在基准地址的基础上,根据处于N个维度方向的对角位置的至少两个顶点相对于所述数据基准点的位置,确定所述张量数据的描述符的内容。In a possible implementation manner, a reference address of the data reference point of the descriptor in the data storage space may be agreed, and on the basis of the reference address, according to at least two diagonal positions in the N dimension directions The position of each vertex relative to the data reference point determines the content of the descriptor of the tensor data.
举例来说,可以约定描述符的数据基准点在数据存储空间中的基准地址PA_base。例如,可以在数据存储空间21中选取一个数据(例如,位置为(2,2)的数据)作为数据基准点,将该数据在数据存储空间中的物理地址作为基准地址PA_base。可以根据对角位置的两个顶点相对于数据基准点的位置,确定出图1b中数据块23的描述符的内容。首先,确定数据块23的对角位置的至少两个顶点相对于数据基准点的位置,例如,使用左上至右下方向的对角位置顶点相对于数据基准点的位置,其中,左上角顶点的相对位置为(x_min,y_min),右下角顶点的相对位置为(x_max,y_max),然后可以根据基准地址PA_base、左上角顶点的相对位置(x_min,y_min)以及右下角顶点的相对位置(x_max,y_max)确定出数据块23的描述符的内容。For example, the base address PA_base of the data base point of the descriptor in the data storage space may be agreed. For example, one piece of data (for example, the data at the position (2, 2)) may be selected in the data storage space 21 as the data reference point, and the physical address of the data in the data storage space may be used as the reference address PA_base. The content of the descriptor of the data block 23 in FIG. 1b can be determined according to the positions of the two diagonal vertices relative to the data reference point. First, the positions of at least two vertices of the diagonal positions of the data block 23 relative to the data reference point are determined, for example, the positions of the diagonal position vertices relative to the data reference point in the upper left to lower right direction are used, wherein the upper left corner vertex is The relative position is (x_min, y_min), the relative position of the lower right vertex is (x_max, y_max), and then the relative position of the upper left vertex (x_min, y_min) and the relative position of the lower right vertex (x_max, y_max) determines the content of the descriptor of the data block 23.
在一种可能的实现方式中,可以使用下述公式(2)来表示描述符的内容(基准地址为PA_base):In a possible implementation, the following formula (2) can be used to represent the content of the descriptor (the base address is PA_base):
Figure PCTCN2021095705-appb-000003
Figure PCTCN2021095705-appb-000003
应当理解,虽然上述示例中使用左上角和右下角两个对角位置的顶点来确定描述符的内容,但本领域技术人员可以根据实际需要对对角位置的至少两个顶点的具体顶点进行设置,本披露对此不作限制。It should be understood that, although the vertices in the upper left corner and the lower right corner are used to determine the content of the descriptor in the above example, those skilled in the art can set the specific vertices of the at least two vertices in the diagonal positions according to actual needs. , this disclosure does not limit this.
在一种可能的实现方式中,可根据所述描述符的数据基准点在所述数据存储空间中的基准地址,以及所述描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系,确定所述张量数据的描述符的内容。其中,数据描述位置与数据地址之间 的映射关系可以根据实际需要进行设定,例如,描述符所指示的张量数据为三维空间数据时,可以使用函数f(x,y,z)来定义数据描述位置与数据地址之间的映射关系。In a possible implementation manner, the reference address of the data reference point of the descriptor in the data storage space and the distance between the data description position and the data address of the tensor data indicated by the descriptor may be used. The mapping relationship determines the content of the descriptor of the tensor data. Among them, the mapping relationship between the data description position and the data address can be set according to actual needs. For example, when the tensor data indicated by the descriptor is three-dimensional space data, the function f(x, y, z) can be used to define The data describes the mapping relationship between the location and the data address.
在一种可能的实现方式中,可以使用下述公式(3)来表示描述符的内容:In a possible implementation, the following formula (3) can be used to represent the content of the descriptor:
Figure PCTCN2021095705-appb-000004
Figure PCTCN2021095705-appb-000004
在一种可能的实现方式中,所述描述符还用于指示N维的张量数据的地址,其中,所述描述符的内容还包括表示张量数据的地址的至少一个地址参数,例如描述符的内容可以是:In a possible implementation manner, the descriptor is further used to indicate the address of N-dimensional tensor data, wherein the content of the descriptor further includes at least one address parameter indicating the address of the tensor data, such as description The content of the character can be:
D:
Figure PCTCN2021095705-appb-000005
D:
Figure PCTCN2021095705-appb-000005
其中PA为地址参数。地址参数可以是逻辑地址,也可以是物理地址。描述符解析电路可以以PA为向量形状的顶点、中间点或预设点中的任意一个,结合X方向和Y方向的形状参数得到对应的数据地址。Where PA is the address parameter. The address parameter can be a logical address or a physical address. The descriptor parsing circuit can take PA as any one of the vertex, middle point or preset point of the vector shape, and obtain the corresponding data address in combination with the shape parameters in the X and Y directions.
在一种可能的实现方式中,所述张量数据的地址参数包括所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址,所述基准地址包括所述数据存储空间的起始地址。In a possible implementation manner, the address parameter of the tensor data includes a reference address of the data reference point of the descriptor in the data storage space of the tensor data, and the reference address includes the data storage The starting address of the space.
在一种可能的实现方式中,描述符还可以包括表示张量数据的地址的至少一个地址参数,例如描述符的内容可以是:In a possible implementation manner, the descriptor may further include at least one address parameter representing the address of the tensor data, for example, the content of the descriptor may be:
D:
Figure PCTCN2021095705-appb-000006
D:
Figure PCTCN2021095705-appb-000006
其中PA_start为基准地址参数,不再赘述。Among them, PA_start is a reference address parameter, which is not repeated here.
应当理解,本领域技术人员可以根据实际情况对数据描述位置与数据地址之间的映射关系进行设定,本披露对此不作限制。It should be understood that those skilled in the art can set the mapping relationship between the data description location and the data address according to the actual situation, which is not limited in the present disclosure.
在一种可能的实现方式中,可以在一个任务中设定约定的基准地址,此任务下指令中的描述符均使用此基准地址,描述符内容中可以包括基于此基准地址的形状参数。可以通过设定此任务的环境参数的方式确定此基准地址。基准地址的相关描述和使用方式可参见上述实施例。此种实现方式下,描述符的内容可以更快速的被映射为数据地址。In a possible implementation manner, a predetermined reference address may be set in a task, the descriptors in the instructions under this task all use the reference address, and the content of the descriptor may include shape parameters based on the reference address. The base address can be determined by setting the environment parameters for this task. For the relevant description and usage of the reference address, reference may be made to the foregoing embodiments. In this implementation, the content of the descriptor can be mapped to the data address more quickly.
在一种可能的实现方式中,可以在各描述符的内容中包含基准地址,则各描述符的基准地址可不同。相对于利用环境参数设定共同的基准地址的方式,此种方式中的各描述符可以更加灵活的描述数据,并使用更大的数据地址空间。In a possible implementation manner, a reference address may be included in the content of each descriptor, and the reference address of each descriptor may be different. Compared with the method of setting a common reference address by using environmental parameters, each descriptor in this method can describe data more flexibly and use a larger data address space.
在一种可能的实现方式中,可根据描述符的内容,确定与处理指令的操作数对应的数据在数据存储空间中的数据地址。其中,数据地址的计算由硬件自动完成,且描述符的内容的表示方式不同时,数据地址的计算方法也会不同。本披露对数据地址的具体计算方法不作限制。In a possible implementation manner, the data address in the data storage space of the data corresponding to the operand of the processing instruction may be determined according to the content of the descriptor. Among them, the calculation of the data address is automatically completed by the hardware, and when the representation of the content of the descriptor is different, the calculation method of the data address is also different. This disclosure does not limit the specific calculation method of the data address.
例如,操作数中描述符的内容是使用公式(1)表示的,描述符所指示的张量数据在数据存储空间中的偏移量分别为offset_x和offset_y,尺寸为size_x*size_y,那么,该描述符所指示的张量数据在数据存储空间中的起始数据地址PA1 (x,y)可以使用下述 公式(4)来确定: For example, the content of the descriptor in the operand is represented by formula (1), the offsets of the tensor data indicated by the descriptor in the data storage space are offset_x and offset_y respectively, and the size is size_x*size_y, then, the The starting data address PA1 (x, y) of the tensor data indicated by the descriptor in the data storage space can be determined using the following formula (4):
PA1 (x,y)=PA_start+(offset_y-1)*ori_x+offset_x      (4) PA1 (x,y) =PA_start+(offset_y-1)*ori_x+offset_x(4)
根据上述公式(4)确定的数据起始地址PA1 (x,y),结合偏移量offset_x和offset_y,以及存储区域的尺寸size_x和size_y,可确定出描述符所指示的张量数据在数据存储空间中的存储区域。 According to the data starting address PA1 (x,y ) determined by the above formula (4), combined with the offset offset_x and offset_y, and the size_x and size_y of the storage area, it can be determined that the tensor data indicated by the descriptor is stored in the data storage area. storage area in space.
在一种可能的实现方式中,当操作数还包括针对描述符的数据描述位置时,可根据描述符的内容以及数据描述位置,确定操作数对应的数据在数据存储空间中的数据地址。通过这种方式,可以对描述符所指示的张量数据中的部分数据(例如一个或多个数据)进行处理。In a possible implementation manner, when the operand further includes a data description location for the descriptor, the data address in the data storage space of the data corresponding to the operand can be determined according to the content of the descriptor and the data description location. In this way, part of the data (eg, one or more data) in the tensor data indicated by the descriptor can be processed.
例如,操作数中描述符的内容是使用公式(1)表示的,描述符所指示的张量数据在数据存储空间中偏移量分别为offset_x和offset_y,尺寸为size_x*size_y,操作数中包括的针对描述符的数据描述位置为(x q,y q),那么,该描述符所指示的张量数据在数据存储空间中的数据地址PA2 (x,y)可以使用下述公式(5)来确定: For example, the content of the descriptor in the operand is represented by formula (1). The offsets of the tensor data indicated by the descriptor in the data storage space are offset_x and offset_y respectively, and the size is size_x*size_y. The operand includes The data description position for the descriptor is (x q , y q ), then, the data address PA2 (x, y) of the tensor data indicated by the descriptor in the data storage space can use the following formula (5) to make sure:
PA2 (x,y)=PA_start+(offset_y+y q-1)*ori_x+(offset_x+x q)    (5) PA2 (x,y) = PA_start+(offset_y+y q -1)*ori_x+(offset_x+x q ) (5)
以上结合图1a和图1b对本披露的计算装置进行了描述,通过利用本披露的计算装置以及主指令和从指令,可以利用一条计算指令完成多个操作,减少了多个操作需由多个指令完成所导致的每个指令需要进行的数据搬运,解决了计算装置IO瓶颈问题,有效地提高计算的效率和降低计算的开销。另外,本披露的方案还可以根据主处理电路中配置的运算器的类型、从处理电路中配置的运算电路的功能,并通过主处理电路和从处理电路的协作,灵活设置计算指令中所包括的操作的类型和数量,以使计算装置可以执行多种类型的计算操作,从而扩展和丰富了计算装置的应用场景,满足不同的计算需求。The computing device of the present disclosure has been described above in conjunction with FIG. 1a and FIG. 1b. By using the computing device and the master instruction and the slave instruction of the present disclosure, one computing instruction can be used to complete multiple operations, reducing the need for multiple instructions for multiple operations. Completing the data transfer required for each instruction caused by it solves the IO bottleneck problem of the computing device, effectively improving the computing efficiency and reducing the computing overhead. In addition, the solution of the present disclosure can also flexibly set the parameters included in the calculation instruction according to the type of the arithmetic unit configured in the master processing circuit, the function of the arithmetic circuit configured in the slave processing circuit, and through the cooperation of the master processing circuit and the slave processing circuit. The type and number of operations can be used so that the computing device can perform various types of computing operations, thereby expanding and enriching the application scenarios of the computing device and meeting different computing requirements.
另外,由于主处理电路和从处理电路可以配置成支持多级流水运算,从而提升了主处理电路和从处理电路中运算器的执行效率,进一步缩短计算的时间。根据上文的描述,本领域技术人员可以理解图1a所示硬件架构仅仅是示例性而非限制性的。在本披露的公开和教导下,本领域技术人员也可以基于该架构来增加新的电路或器件,以实现更多的功能或操作。例如,可以在图1a所示的架构中增加存储电路,以存储各类指令和数据(例如张量数据)。进一步,也可以将主处理电路和从处理电路布置于不同的物理或逻辑位置,并且二者之间可以通过各种数据接口或互联单元来连接,以便通过二者的交互来完成上文的主运算操作和从运算操作,包括对结合图1b所述的张量的各类操作。In addition, since the master processing circuit and the slave processing circuit can be configured to support multi-stage pipeline operations, the execution efficiency of the operators in the master processing circuit and the slave processing circuit is improved, and the calculation time is further shortened. Based on the above description, those skilled in the art can understand that the hardware architecture shown in FIG. 1a is only exemplary and not limiting. Under the disclosure and teaching of the present disclosure, those skilled in the art can also add new circuits or devices based on the architecture to realize more functions or operations. For example, memory circuits can be added to the architecture shown in FIG. 1a to store various types of instructions and data (eg, tensor data). Further, the master processing circuit and the slave processing circuit can also be arranged in different physical or logical positions, and the two can be connected through various data interfaces or interconnection units, so that the above master processing circuit can be completed through the interaction of the two. Arithmetic operations and slave arithmetic operations include various operations on the tensors described in connection with Figure 1b.
图2是示出根据本披露实施例的计算装置200的框图。可以理解的是,图2中所示出的计算装置200是图1a所示计算装置100的一种具体实现方式,因此结合图1a所描述的计算装置100的主处理电路和从处理电路的细节也同样适用于图2所示出的计算装置200。FIG. 2 is a block diagram illustrating a computing device 200 according to an embodiment of the present disclosure. It can be understood that the computing device 200 shown in FIG. 2 is a specific implementation of the computing device 100 shown in FIG. 1a, so the details of the master processing circuit and the slave processing circuit of the computing device 100 described in conjunction with FIG. 1a The same applies to the computing device 200 shown in FIG. 2 .
如图2中所示,根据本披露的计算装置200包括主处理电路202和多个从处理电路204、206和208。由于前文结合图1a对主处理电路和从处理电路的操作进行了详细的描述,此处将不再赘述。除了包括与图1a所示计算装置100相同的主处理电路和从处理电路,图2的计算装置200还包括控制电路210和存储电路212。在一个实施例中,控制电路可以配置成获取所述计算指令并且对该计算指令进行解析,以得到 所述主指令和从指令,并且将所述主指令发送至所述主处理电路202并且将所述从指令发送至所述多个从处理电路204、206和208中的一个或多个。在一个场景中,控制电路可以通过主处理电路将解析后得到的从指令发送至从处理电路,即如图2中所示出的。替代地,当控制电路和从处理电路之间存在连接时,控制电路也可以直接向从处理电路发送解析后的从指令。类似地,当存储电路与从处理电路之间存在连接时,控制电路也可以经由存储电路将从指令发送到从处理电路。在一些计算场景中,当计算指令中包括涉及张量运算的操作数时,控制电路可以利用前面讨论的描述符来确定操作数对应数据的存储地址,例如张量的起始地址,并且可以指示主处理电路或从处理电路从存储电路212中的对应存储地址处获取参与张量运算的张量数据,以便执行张量运算。As shown in FIG. 2 , a computing device 200 according to the present disclosure includes a master processing circuit 202 and a plurality of slave processing circuits 204 , 206 and 208 . Since the operations of the master processing circuit and the slave processing circuit have been described in detail above with reference to FIG. 1a, they will not be repeated here. In addition to including the same master processing circuitry and slave processing circuitry as computing device 100 shown in FIG. 1 a , computing device 200 of FIG. 2 also includes control circuitry 210 and storage circuitry 212 . In one embodiment, the control circuit may be configured to obtain and parse the calculation instruction to obtain the master and slave instructions, and to send the master instruction to the master processing circuit 202 and to The slave instructions are sent to one or more of the plurality of slave processing circuits 204 , 206 and 208 . In one scenario, the control circuit may send the parsed slave instruction to the slave processing circuit through the master processing circuit, as shown in FIG. 2 . Alternatively, when there is a connection between the control circuit and the slave processing circuit, the control circuit may also directly send the parsed slave instruction to the slave processing circuit. Similarly, when there is a connection between the storage circuit and the slave processing circuit, the control circuit may also send slave instructions to the slave processing circuit via the storage circuit. In some computing scenarios, when the computing instruction includes operands involving tensor operations, the control circuit can use the previously discussed descriptor to determine the storage address of the data corresponding to the operand, such as the starting address of the tensor, and can indicate The master processing circuit or the slave processing circuit acquires the tensor data participating in the tensor operation from the corresponding storage address in the storage circuit 212 in order to perform the tensor operation.
在一个或多个实施例中,存储电路212可以存储各类与计算相关的数据或指令,例如包括上文所述的张量。在一个场景中,存储电路可以存储与神经网络运算相关的神经元或权值数据,或者存储经主处理电路执行后处理操作后获得的最终运算结果。附加地,存储电路可以存储经主处理电路执行前处理操作后所获得的中间结果,或经从处理电路执行运算操作后所获得的中间结果。在针对于张量的操作中,前述的中间结果也可以是张量类型的数据,并且通过描述符确定的存储地址来读取和存放。在一些应用场景中,存储电路可以用作计算装置200的片上存储器来与片外存储器执行数据读写操作,例如通过直接存储器访问(“DMA”)接口。在一些场景中,当计算指令由控制电路来解析时,存储电路可以存储由控制电路解析后所得到的运算指令,例如主指令和/或从指令。另外,尽管图2中以一个框图示出存储电路,但根据应用场景的不同,存储电路可以实现为包括主存储器和主缓存器的存储器,其中主存储器可以用于存储相关的运算数据例如神经元、权值和各类常数项,而主缓存模块可以用于临时存储中间数据,例如经所述前处理操作后的数据和后处理操作前的数据,而这些中间数据根据设置可以对于操作人员不可见。In one or more embodiments, storage circuitry 212 may store various types of computation-related data or instructions, including, for example, the tensors described above. In one scenario, the storage circuit may store neuron or weight data related to the operation of the neural network, or store the final operation result obtained after the main processing circuit performs post-processing operations. Additionally, the storage circuit may store an intermediate result obtained after the preprocessing operation is performed by the main processing circuit, or an intermediate result obtained after the operation operation is performed by the slave processing circuit. In operations on tensors, the aforementioned intermediate results can also be tensor-type data, and are read and stored through the storage address determined by the descriptor. In some application scenarios, storage circuitry may be used as on-chip memory for computing device 200 to perform data read and write operations with off-chip memory, such as through a direct memory access ("DMA") interface. In some scenarios, when the calculation instruction is parsed by the control circuit, the storage circuit may store the operation instruction obtained after parsing by the control circuit, such as a master instruction and/or a slave instruction. In addition, although the storage circuit is shown in a block diagram in FIG. 2, according to different application scenarios, the storage circuit can be implemented as a memory including a main memory and a main buffer, wherein the main memory can be used to store related operation data such as neural elements, weights and various constant items, and the main cache module can be used to temporarily store intermediate data, such as the data after the pre-processing operation and the data before the post-processing operation, and these intermediate data can be used for the operator according to the settings. Invisible.
在主存储器与主处理电路的交互应用中,主处理电路中的流水运算电路还可以借助于存储在主存储电路中的掩码进行对应的操作。例如,在执行流水运算的过程中,该运算电路可以从主存储电路中读取一个掩码,并且可以利用该掩码来表示该运算电路中执行运算操作的数据是否有效。主存储电路不仅可以进行内部的存储应用,还具有与本披露的计算装置外的存储装置进行数据交互的功能,例如可以通过直接存储器访问(“DMA”)与外部的存储装置进行数据交换。In the interactive application between the main memory and the main processing circuit, the pipeline operation circuit in the main processing circuit can also perform corresponding operations by means of the mask stored in the main storage circuit. For example, in the process of performing the pipeline operation, the operation circuit may read a mask from the main storage circuit, and the mask may be used to indicate whether the data in the operation circuit to perform the operation operation is valid. The main storage circuit can not only perform internal storage applications, but also have the function of data interaction with storage devices outside the computing device of the present disclosure, such as data exchange with external storage devices through direct memory access ("DMA").
图3是示出根据本披露实施例的计算装置的主处理电路300的框图。可以理解的是图3所示出的主处理电路300也即结合图1a和图2所示和描述的主处理电路,因此对图1a和图2中的主处理电路的描述也同样适用于下文结合图3的描述。3 is a block diagram illustrating a main processing circuit 300 of a computing device according to an embodiment of the present disclosure. It can be understood that the main processing circuit 300 shown in FIG. 3 is also the main processing circuit shown and described in conjunction with FIG. 1 a and FIG. 2 , so the description of the main processing circuit in FIG. 1 a and FIG. In conjunction with the description of FIG. 3 .
如图3中所示,所述主处理电路300可以包括数据处理单元302、第一组流水运算电路304和最后一组流水运算电路306以及位于二组之间的一组或多组流水运算电路(以黑圈替代)。在一个实施例中,数据处理单元302包括数据转换电路3021和数据拼接电路3022。如前所述,当主运算操作包括针对于从运算操作的前处理操作时,例如数据转换操作或数据拼接操作时,数据转换电路3021或数据拼接电路3022将根据相应的主指令来执行相应的转换操作或拼接操作。下面将以示例来说明转换操作和拼接操作。As shown in FIG. 3, the main processing circuit 300 may include a data processing unit 302, a first group of pipeline operation circuits 304, a last group of pipeline operation circuits 306, and one or more groups of pipeline operation circuits located between the two groups (replace with black circles). In one embodiment, the data processing unit 302 includes a data conversion circuit 3021 and a data splicing circuit 3022 . As mentioned above, when the master operation includes a preprocessing operation for the slave operation, such as a data conversion operation or a data splicing operation, the data conversion circuit 3021 or the data splicing circuit 3022 will perform the corresponding conversion according to the corresponding master instruction operation or stitching operation. The conversion operation and the splicing operation will be described below with an example.
就数据转换操作而言,当输入到数据转换电路的数据位宽较高时(例如数据位宽为1024比特位宽),则数据转换电路可以根据运算要求将输入数据转换为较低比特位宽的数据(例如输出数据的位宽为512比特位宽)。根据不同的应用场景,数据转换电路可以支持多种数据类型之间的转换,例如可以进行FP16(浮点数16位)、FP32(浮点数32位)、FIX8(定点数8位)、FIX4(定点数4位)、FIX16(定点数16位)等具有不同比特位宽的数据类型间转换。当输入到数据转换电路的数据是矩阵时,数据转换操作可以是针对矩阵元素的排列位置进行的变换。该变换可以例如包括矩阵转置与镜像(稍后结合图4a-图4c描述)、矩阵按照预定的角度(例如是90度、180度或270度)旋转和矩阵维度的转换。As far as the data conversion operation is concerned, when the data bit width input to the data conversion circuit is relatively high (for example, the data bit width is 1024 bits wide), the data conversion circuit can convert the input data to a lower bit width according to the operation requirements. data (for example, the bit width of the output data is 512 bits wide). According to different application scenarios, the data conversion circuit can support conversion between multiple data types, such as FP16 (floating point 16-bit), FP32 (floating point 32-bit), FIX8 (fixed-point 8-bit), FIX4 (fixed-point 8-bit) Conversion between data types with different bit widths, such as 4-bit point), FIX16 (16-bit fixed point), etc. When the data input to the data conversion circuit is a matrix, the data conversion operation may be a conversion performed on the arrangement positions of the matrix elements. The transformation may include, for example, matrix transposition and mirroring (described later in conjunction with Figures 4a-4c), matrix rotation according to a predetermined angle (eg, 90 degrees, 180 degrees or 270 degrees), and transformation of matrix dimensions.
就数据拼接操作而言,数据拼接电路可以根据例如指令中设定的位长对数据中提取的数据块进行奇偶拼接等操作。例如,当数据位长为32比特位宽时,数据拼接电路可以按照4比特的位宽长度将数据分为1~8共8个数据块,然后将数据块1、3、5和7共四个数据块拼接在一起,并且将数据2、4、6和8共四个数据块拼接在一起以用于运算。As far as the data splicing operation is concerned, the data splicing circuit can perform parity splicing and other operations on the data blocks extracted from the data according to, for example, the bit length set in the instruction. For example, when the data bit length is 32 bits wide, the data splicing circuit can divide the data into 8 data blocks from 1 to 8 according to the bit width length of 4 bits, and then divide the data blocks 1, 3, 5 and 7 into four data blocks. The data blocks are spliced together, and four data blocks of data 2, 4, 6, and 8 are spliced together for operation.
在另一些应用场景中,还可以针对执行运算后获得的数据M(例如可以是向量)执行上述的数据拼接操作。假设数据拼接电路可以将数据M偶数行的低256位先以8位比特位宽作为1个单位数据进行拆分,以得到32个偶数行单位数据(分别表示为M_2i 0至M_2i 31)。类似地,可以将数据M奇数行的低256位也以8位比特位宽作为1个单位数据进行拆分,以得到32个奇数行单位数据(分别表示为M_(2i+1) 0至M_(2i+1) 31)。进一步,根据数据位由低到高、先偶数行后奇数行的顺序依次交替布置拆分后的32个奇数行单位数据与32个偶数行单位数据。具体地,将偶数行单位数据0(M_2i 0)布置在低位,再顺序布置奇数行单位数据0(M_(2i+1) 0)。接着,布置偶数行单位数据1(M_2i 1)……。以此类推,当完成奇数行单位数据31(M_(2i+1) 31)的布置时,64个单位数据拼接在一起以组成一个512位比特位宽的新数据。 In other application scenarios, the above-mentioned data splicing operation may also be performed on the data M (for example, a vector) obtained after performing the operation. It is assumed that the data splicing circuit can split the lower 256 bits of the even-numbered lines of the data M first with an 8-bit bit width as 1 unit data to obtain 32 even-numbered line unit data (represented as M_2i 0 to M_2i 31 respectively ). Similarly, the lower 256 bits of the odd-numbered rows of the data M can also be split with an 8-bit bit width as 1 unit data to obtain 32 odd-numbered row unit data (represented as M_(2i+1) 0 to M_ (2i+1) 31 ). Further, according to the sequence of the data bits from low to high, the even-numbered rows first and then the odd-numbered rows, the split 32 odd-numbered row unit data and 32 even-numbered row unit data are alternately arranged. Specifically, the even-numbered line unit data 0 (M_2i 0 ) is arranged in the lower order, and then the odd-numbered line unit data 0 (M_(2i+1) 0 ) is sequentially arranged. Next, even line unit data 1 (M_2i 1 ) . . . are arranged. By analogy, when the arrangement of odd row unit data 31 (M_(2i+1) 31 ) is completed, 64 unit data are spliced together to form a new data with a bit width of 512 bits.
根据不同的应用场景,数据处理单元中的数据转换电路和数据拼接电路可以配合使用,以便更加灵活地进行数据前处理。例如,根据主指令中包括的不同操作,数据处理单元可以仅执行数据转换而不执行数据拼接操作、仅执行数据拼接操作而不执行数据转换、或者既执行数据转换又执行数据拼接操作。在一些场景中,当所述主指令中并不包括针对于从运算操作的前处理操作时,则数据处理单元可以配置成禁用所述数据转换电路和数据拼接电路。According to different application scenarios, the data conversion circuit and the data splicing circuit in the data processing unit can be used together to perform data preprocessing more flexibly. For example, depending on the different operations included in the host instruction, the data processing unit may perform only data conversion without data splicing, only data splicing without data conversion, or both data conversion and data splicing. In some scenarios, when the master instruction does not include a preprocessing operation for the slave operation, the data processing unit may be configured to disable the data conversion circuit and the data splicing circuit.
如前所述,根据本披露的主处理电路可以包括一组或多组多级流水运算电路,如图3中示出的两组多级流水运算电路304和306,其中每组多级流水运算电路执行包括第一级到第N级的多级流水操作,其中每一级可以包括一个或多个运算器,以便根据所述主指令来执行多级流水运算。在一个实施例中,本披露的主处理电路可以实现为单指令多数据(Single Instruction Multiple Data,SIMD)模块,并且每组多级流水运算电路可以形成一条运算流水线。该运算流水线根据运算需求可以逐级设置数量不等的、不同的或相同的功能模块(也即本披露的运算器),例如加法模块(或加法器)、乘法模块(或乘法器)、查表模块(或查表器)等各种类型的功能模块。As previously mentioned, the main processing circuit according to the present disclosure may include one or more groups of multi-stage pipeline operation circuits, such as two groups of multi-stage pipeline operation circuits 304 and 306 as shown in FIG. 3 , wherein each group of multi-stage pipeline operation circuits The circuit performs a multi-stage pipeline operation including the first stage to the Nth stage, wherein each stage may include one or more operators to perform the multi-stage pipeline operation according to the main instruction. In one embodiment, the main processing circuit of the present disclosure may be implemented as a single instruction multiple data (Single Instruction Multiple Data, SIMD) module, and each group of multi-stage pipeline operation circuits may form an operation pipeline. According to the operation requirements, the operation pipeline can be provided with different, different or the same functional modules (that is, the operators of the present disclosure), such as an addition module (or an adder), a multiplication module (or a multiplier), a checker Various types of functional modules such as table modules (or table look-ups).
在一些应用场景中,当满足流水线的顺序要求时,可以将一条流水线上的不同功能模块组合使用,一级流水完成一个微指令中的一个操作码(“op”)所代表的操作。 由此,本披露的SIMD可以支持不同级别的流水操作。即,基于运算流水线上运算器的设置,本披露的SIMD可以灵活地支持不同数量的op的组合。In some application scenarios, when the sequence requirements of the pipeline are met, different functional modules on a pipeline can be used in combination, and the first-level pipeline completes the operation represented by an operation code ("op") in a microinstruction. Thus, the SIMD of the present disclosure can support different levels of pipelining. That is, the SIMD of the present disclosure can flexibly support combinations of different numbers of ops based on the arrangement of the operators on the operation pipeline.
假设存在与第一组多级流水运算电路304和第二组多级流水线运算电路306相类似的一条流水线(以“stage1”来表示其名称),其按照从上到下的顺序设置有六个功能模块以形成六级流水线,具体可以为:stage1-1-加法器1(第一级加法器)、stage1-2-加法器2(第二级加法器)、stage1-3-乘法器1(第一级乘法器)、stage1-4-乘法器2(第二级乘法器)、stage1-5-加法器1(第一级加法器)、stage1-6-加法器2(第二级加法器)。可以看出,第一级加法器(其作为流水线的第一级)和第二级加法器(其作为流水线的第二级)配合使用,以便完成加法操作的两级运算。同样地,第一级乘法器和第二级乘法器也执行类似地两级运算。当然,此处的两级加法器或乘法器仅仅是示例性地而非限制性地,在一些应用场景中,也可以在多级流水线中仅设置一级加法器或乘法器。It is assumed that there is a pipeline (named "stage1") similar to the first group of multi-stage pipeline operation circuits 304 and the second group of multi-stage pipeline operation circuits 306, and six pipelines are arranged in order from top to bottom Function modules to form a six-stage pipeline, specifically: stage1-1-adder 1 (first-stage adder), stage1-2-adder 2 (second-stage adder), stage1-3-multiplier 1 ( First stage multiplier), stage1-4-multiplier 2 (second stage multiplier), stage1-5-adder 1 (first stage adder), stage1-6-adder 2 (second stage adder) ). It can be seen that the first stage adder (which acts as the first stage of the pipeline) and the second stage adder (which acts as the second stage of the pipeline) cooperate to complete the two-stage operation of the addition operation. Likewise, the first-stage multiplier and the second-stage multiplier also perform similar two-stage operations. Of course, the two-stage adder or multiplier here is merely an example and not a limitation, and in some application scenarios, only one-stage adder or multiplier may be provided in a multi-stage pipeline.
在一些实施例中,还可以设置两条或更多条如上所述的流水线,其中每条流水线中可以包括若干个相同或不同的运算器,以实现相同或不同的功能。进一步,不同的流水线可以包括不同的运算器,以便各个流水线实现不同功能的运算操作。实现前述不同功能的运算器或电路可以包括但不限于随机数处理电路、加减电路、减法电路、查表电路、参数配置电路、乘法器、除法器、池化器、比较器、求绝对值电路、逻辑运算器、位置索引电路或过滤器。这里以池化器为例,其可以示例性由加法器、除法器、比较器等运算器来构成,以便执行神经网络中的池化操作。In some embodiments, two or more pipelines as described above may also be provided, wherein each pipeline may include several same or different operators to achieve the same or different functions. Further, different pipelines may include different arithmetic units, so that each pipeline implements arithmetic operations of different functions. The operators or circuits that realize the aforementioned different functions may include, but are not limited to, random number processing circuits, addition and subtraction circuits, subtraction circuits, table look-up circuits, parameter configuration circuits, multipliers, dividers, poolers, comparators, and absolute values. circuit, logic operator, position index circuit or filter. Here, a pooler is taken as an example, which can be exemplarily constituted by operators such as an adder, a divider, a comparator, etc., so as to perform the pooling operation in the neural network.
在一些应用场景中,主处理电路中的多级流水运算可以支持一元运算(即只有一项输入数据的情形)。以神经网络中的scale层+relu层处的运算操作为例,假设待执行的计算指令表达为result=relu(a*ina+b),其中ina是输入数据(例如可以是向量、矩阵或者张量),a、b均为运算常量。对于该计算指令,可以应用本披露的包括乘法器、加法器、非线性运算器的一组三级流水运算电路来执行运算。具体来说,可以利用第一级流水的乘法器计算输入数据ina与a的乘积,以获得第一级流水运算结果。接着,可以利用第二级流水的加法器,对该第一级流水运算结果(a*ina)与b执行加法运算获得第二级流水运算结果。最后,可以利用第三级流水的relu激活函数,对该第二级流水运算结果(a*ina+b)进行激活操作,以获得最终的运算结果result。In some application scenarios, multi-stage pipeline operations in the main processing circuit can support unary operations (ie, a situation where there is only one item of input data). Taking the operation operation at the scale layer + relu layer in the neural network as an example, it is assumed that the calculation instruction to be executed is expressed as result=relu(a*ina+b), where ina is the input data (for example, it can be a vector, matrix or sheet amount), a and b are both arithmetic constants. For the calculation instruction, a set of three-stage pipeline operation circuits including a multiplier, an adder, and a nonlinear operator of the present disclosure can be applied to perform the operation. Specifically, the multiplier of the first-stage pipeline can be used to calculate the product of the input data ina and a to obtain the first-stage pipeline operation result. Next, the adder of the second-stage pipeline can be used to perform an addition operation on the first-stage pipeline operation result (a*ina) and b to obtain the second-stage pipeline operation result. Finally, the relu activation function of the third-stage pipeline can be used to activate the second-stage pipeline operation result (a*ina+b) to obtain the final operation result result.
在一些应用场景中,主处理电路中的多级流水运算电路可以支持二元运算(例如卷积计算指令result=conv(ina,inb))或三元运算(例如卷积计算指令result=conv(ina,inb,bias)),其中输入数据ina、inb与bias既可以是一维张量(即向量,其例如可以是整型、定点型或浮点型数据),也可以是二维张量(即矩阵),或者是3维或3维以上的张量。这里以卷积计算指令result=conv(ina,inb)为例,可以利用三级流水运算电路结构中包括的多个乘法器、至少一个加法树和至少一个非线性运算器来执行该计算指令所表达的卷积运算,其中两个输入数据ina和inb可以例如是神经元数据。具体来说,首先可以利用三级流水运算电路中的第一级流水乘法器进行计算,从而可以获得第一级流水运算结果product=ina*inb(视为运算指令中的一条微指令,其对应于乘法操作)。继而可以利用第二级流水运算电路中的加法树对第一级流水运算结果“product”执行加和操作,以获得第二级流水运算结果sum。最后,利用第三级流水运算电路的非线性运算器对“sum”执行激活操作,从而得到最终的卷积运算结果。In some application scenarios, the multi-stage pipeline operation circuit in the main processing circuit can support binary operations (such as convolution calculation instruction result=conv(ina, inb)) or ternary operation (such as convolution calculation instruction result=conv( ina, inb, bias)), where the input data ina, inb, and bias can be either one-dimensional tensors (ie, vectors, which can be integer, fixed-point, or floating-point data, for example), or two-dimensional tensors (i.e. a matrix), or a tensor with 3 or more dimensions. Taking the convolution calculation instruction result=conv(ina, inb) as an example here, multiple multipliers, at least one addition tree and at least one nonlinear operator included in the three-stage pipeline operation circuit structure can be used to execute the calculation instruction. Expressed convolution operation, where the two input data ina and inb can be, for example, neuron data. Specifically, firstly, the first-stage pipeline multiplier in the three-stage pipeline operation circuit can be used for calculation, so that the first-stage pipeline operation result product=ina*inb can be obtained (as a microinstruction in the operation instruction, which corresponds to for multiplication). Then, an addition operation may be performed on the first-stage pipeline operation result "product" by using the addition tree in the second-stage pipeline operation circuit to obtain the second-stage pipeline operation result sum. Finally, use the nonlinear operator of the third-stage pipeline operation circuit to perform an activation operation on "sum" to obtain the final convolution operation result.
在一些应用场景中,可以对运算操作中将不使用的一级或多级流水运算电路执行旁路操作,即可以根据运算操作的需要选择性地使用多级流水运算电路的一级或多级,而无需令运算操作经过所有的多级流水操作。以计算欧式距离的运算操作为例,假设其计算指令表示为dis=sum((ina-inb)^2),可以只使用由加法器、乘法器、加法树和累加器构成的若干级流水运算电路来进行运算以获得最终的运算结果,而对于未使用的流水运算电路,可以在流水运算操作前或操作中予以旁路。In some application scenarios, a bypass operation can be performed on a one-stage or multi-stage pipeline operation circuit that will not be used in the operation operation, that is, one or more stages of the multi-stage pipeline operation circuit can be selectively used according to the needs of the operation operation. , without having to make the operation go through all the multi-stage pipeline operations. Taking the operation of calculating Euclidean distance as an example, assuming that the calculation instruction is expressed as dis=sum((ina-inb)^2), you can only use several stages of pipeline operations consisting of adders, multipliers, addition trees and accumulators The circuit is used to perform operations to obtain the final operation result, and the pipeline operation circuit that is not used can be bypassed before or during the operation of the pipeline operation.
在前述的流水操作中,每组流水运算电路可以独立地执行所述流水操作。然而,多组中的每组流水运算电路也可以协同地执行所述流水操作。例如,第一组流水运算电路中的第一级、第二级执行串行流水运算后的输出可以作为另一组流水运算电路的第三级流水的输入。又例如,第一组流水运算电路中的第一级、第二级执行并行流水运算,并分别输出各自流水运算的结果,作为另一组流水运算电路的第一级和/或第二级流水操作的输入。In the aforementioned pipeline operations, each group of pipeline operation circuits can independently perform the pipeline operations. However, each of the plurality of groups of pipelined circuits may also perform the pipelining operations cooperatively. For example, the output of the first stage and the second stage in the first group of pipeline operation circuits after performing serial pipeline operation can be used as the input of the third stage of pipeline operation of another group of pipeline operation circuits. For another example, the first stage and the second stage in the first group of pipeline operation circuits perform parallel pipeline operations, and respectively output the results of their pipeline operations as the first and/or second stage pipelines of another group of pipeline operation circuits. Operation input.
图4a,4b和4c是示出根据本披露实施例的数据转换电路所执行的矩阵转换示意图。为了更好地理解主处理电路中的数据转换电路3021执行的转换操作,下面将以原始矩阵(可以视为本披露下的2维张量)进行的转置操作与水平镜像操作为例做进一步描述。4a, 4b and 4c are schematic diagrams illustrating matrix conversion performed by a data conversion circuit according to an embodiment of the present disclosure. In order to better understand the conversion operation performed by the data conversion circuit 3021 in the main processing circuit, the following will take the transpose operation and the horizontal mirror operation performed by the original matrix (which can be regarded as a 2-dimensional tensor in the present disclosure) as an example for further details. describe.
如图4a所示,原始矩阵是(M+1)行×(N+1)列的矩阵。根据应用场景的需求,数据转换电路可以对图4a中示出的原始矩阵进行转置操作转换,以获得如图4b所示出的矩阵。具体来说,数据转换电路可以将原始矩阵中元素的行序号与列序号进行交换操作以形成转置矩阵。具体来说,在图4a示出的原始矩阵中坐标是第1行第0列的元素“10”,其在图4b示出的转置矩阵中的坐标则是第0行第1列。以此类推,在图4a示出的原始矩阵中坐标是第M+1行第0列的元素“M0”,其在图4b示出的转置矩阵中的坐标则是第0行第M+1列。As shown in Figure 4a, the original matrix is a matrix of (M+1) rows by (N+1) columns. According to the requirements of the application scenario, the data conversion circuit can perform a transposition operation on the original matrix shown in FIG. 4a to obtain the matrix shown in FIG. 4b. Specifically, the data conversion circuit can exchange the row numbers and column numbers of elements in the original matrix to form a transposed matrix. Specifically, the coordinate in the original matrix shown in Fig. 4a is the element "10" in the 1st row and 0th column, and its coordinate in the transposed matrix shown in Fig. 4b is the 0th row and the 1st column. By analogy, in the original matrix shown in Figure 4a, the coordinate is the element "M0" in the M+1th row and the 0th column, and its coordinate in the transposed matrix shown in Figure 4b is the 0th row M+ 1 column.
如图4c所示,数据转换电路可以对图4a示出的原始矩阵进行水平镜像操作以形成水平镜像矩阵。具体来说,所述数据转换电路可以通过水平镜像操作,将原始矩阵中从首行元素到末行元素的排列顺序转换成从末行元素到首行元素的排列顺序,而对原始矩阵中元素的列号保持不变。具体来说,图4a示出的原始矩阵中坐标分别是第0行第0列的元素“00”与第1行第0列的元素“10”,在图4c中示出的水平镜像矩阵中的坐标则分别是第M+1行第0列与第M行第0列。以此类推,在图4a示出的原始矩阵中坐标是第M+1行第0列的元素“M0”,在图4c示出的水平镜像矩阵中的坐标则是第0行第0列。As shown in FIG. 4c, the data conversion circuit may perform a horizontal mirror operation on the original matrix shown in FIG. 4a to form a horizontal mirror matrix. Specifically, the data conversion circuit can convert the arrangement order from the first row element to the last row element in the original matrix into the arrangement order from the last row element to the first row element through the horizontal mirror operation, and the elements in the original matrix The column numbers remain the same. Specifically, the coordinates in the original matrix shown in Figure 4a are the element "00" in the 0th row and the 0th column and the element "10" in the 1st row and the 0th column, respectively. In the horizontal mirror matrix shown in Figure 4c The coordinates are the M+1 row, column 0, and the M row, column 0, respectively. By analogy, in the original matrix shown in Fig. 4a, the coordinate is the element "M0" in the M+1th row and 0th column, and in the horizontal mirror matrix shown in Fig. 4c, the coordinate is the 0th row and 0th column.
图5是示出根据本披露实施例的计算装置的从处理电路500的框图。可以理解的是图中所示结构仅仅是示例性的而非限制性的,本领域技术人员基于本披露的教导也可以想到增加更多的运算器来形成更多级的流水运算电路。5 is a block diagram illustrating a slave processing circuit 500 of a computing device according to an embodiment of the present disclosure. It can be understood that the structures shown in the figures are only exemplary and not limiting, and those skilled in the art can also think of adding more arithmetic units to form more stages of pipeline arithmetic circuits based on the teachings of the present disclosure.
如图5中所示,从处理电路500包括乘法器502、比较器504、选择器506、累加器508和转换器510构成的四级流水运算电路。在一个应用场景中,该从处理电路整体上可以执行向量(包括例如矩阵)运算。As shown in FIG. 5 , the slave processing circuit 500 includes a four-stage pipeline operation circuit composed of a multiplier 502 , a comparator 504 , a selector 506 , an accumulator 508 and a converter 510 . In one application scenario, the slave processing circuit as a whole may perform vector (including eg matrix) operations.
当执行向量运算中,从处理电路500根据接收到的微指令(如图中所示出的控制信号)来控制包括权值数据和神经元数据的向量数据(可以视为本披露下的1维张量)输入到乘法器中。在执行完乘法操作后,乘法器将结果输入到选择器506。此处, 选择器506选择将乘法器的结果而非来自于比较器的结果传递至累加器508,执行向量运算中的累加操作。接着,累加器将累加后的结果传递至转换器510执行前文描述的数据转换操作。最终,由转换器将累加和(即图中所示“ACC_SUM”)作为最终结果输出。When performing the vector operation, the slave processing circuit 500 controls the vector data including weight data and neuron data according to the received microinstructions (control signals as shown in the figure) (which can be regarded as 1-dimensional under the present disclosure). tensor) into the multiplier. After performing the multiplication operation, the multiplier inputs the result to the selector 506 . Here, selector 506 chooses to pass the result of the multiplier, rather than the result from the comparator, to accumulator 508, performing the accumulation operation in the vector operation. Next, the accumulator transmits the accumulated result to the converter 510 to perform the data conversion operation described above. Finally, the accumulated sum (ie "ACC_SUM" as shown in the figure) is output as the final result by the converter.
除了执行上述的神经元数据和权值数据之间的矩阵乘加(“MAC”)运算以外,图5所示的四级流水运算电路还可以用于执行神经网络运算中的直方图运算、depthwise层乘加运算、积分和winograd乘加运算等运算。当执行直方图运算时,在第一级运算中,从处理电路根据微指令来将输入数据输入到比较器。相应地,此处选择器506选择将比较器的结果而非乘法器的结果传递至累加器以执行后续的操作。In addition to performing the above-mentioned matrix multiply-add ("MAC") operation between neuron data and weight data, the four-stage pipeline operation circuit shown in FIG. 5 can also be used to perform histogram operations, depthwise operations in neural network operations Operations such as layer multiplication and addition, integration and winograd multiplication and addition. When the histogram operation is performed, in the first stage of operation, input data is input to the comparator from the processing circuit according to the microinstruction. Accordingly, here selector 506 chooses to pass the result of the comparator, rather than the result of the multiplier, to the accumulator for subsequent operations.
通过上述的描述,本领域技术人员可以理解就硬件布置来说,本披露的从处理电路可以包括用于执行从运算操作的多个运算电路,并且所述多个运算电路被连接并配置成执行多级流水的运算操作。在一个或多个实施例中,前述的运算电路可以包括但不限于乘法电路、比较电路、累加电路和转数电路中的一个或多个,以至少执行向量运算,例如神经网络中的多维卷积运算。From the above description, those skilled in the art can understand that in terms of hardware arrangement, the slave processing circuit of the present disclosure may include a plurality of operation circuits for performing slave operation operations, and the plurality of operation circuits are connected and configured to perform Multi-stage pipeline operation. In one or more embodiments, the aforementioned arithmetic circuits may include, but are not limited to, one or more of multiplying circuits, comparing circuits, accumulating circuits, and turning circuits to perform at least vector operations, such as multi-dimensional volumes in neural networks Product operation.
在一个运算场景中,本披露的从处理电路可以根据从指令(实现为例如一个或多个微指令或控制信号)对经主处理电路执行前处理操作的数据进行运算,以获得预期的运算结果。在另一个运算场景中,从处理电路可以将其运算后获得的中间结果发送(例如经由互联接口发送)到主处理电路中的数据处理单元,以便由数据处理单元中的数据转换电路来对中间结果执行数据类型转换或者以便由数据处理单元中的数据拼接电路来对中间结果执行数据拆分和拼接操作,从而获得最终的运算结果。下面将结合几个示例性的指令来描述本披露的主处理电路和从处理电路的操作。In an operation scenario, the slave processing circuit of the present disclosure may perform operations on the data pre-processing operations performed by the master processing circuit according to slave instructions (implemented as, for example, one or more microinstructions or control signals) to obtain expected operation results . In another operation scenario, the slave processing circuit may send the intermediate result obtained after its operation (eg, via an interconnect interface) to the data processing unit in the master processing circuit, so that the data conversion circuit in the data processing unit can convert the intermediate result to the data processing unit. The result performs data type conversion or data splitting and splicing operations on the intermediate result by the data splicing circuit in the data processing unit, thereby obtaining the final operation result. The operation of the master and slave processing circuits of the present disclosure will be described below in conjunction with several exemplary instructions.
以包括前处理操作的计算指令“COSHLC”为例,其执行的操作(包括主处理电路执行的前处理操作和从处理电路执行的从运算操作)可以表达为:Taking the calculation instruction "COSHLC" including preprocessing operations as an example, the operations performed by it (including the preprocessing operations performed by the master processing circuit and the slave operation operations performed by the slave processing circuits) can be expressed as:
COSHLC=FPTOFIX+SHUFFLE+LT3DCONV,COSHLC=FPTOFIX+SHUFFLE+LT3DCONV,
其中FPTOFIX表示主处理电路中的数据转换电路执行的数据类型转换操作,即将输入数据从浮点型数转换成定点型数,SHUFFLE表示数据拼接电路执行的数据拼接操作,而LT3DCONV表示从处理电路(以“LT”指代)执行的3DCONV操作,即3维数据的卷积操作。可以理解的是,当仅执行3维数据的卷积操作时,则作为主操作一部分的FPTOFIX和SHUFFLE均可以设置为可选的操作。Among them, FPTOFIX represents the data type conversion operation performed by the data conversion circuit in the main processing circuit, that is, the input data is converted from a floating-point number to a fixed-point number, SHUFFLE represents the data splicing operation performed by the data splicing circuit, and LT3DCONV represents the slave processing circuit ( 3DCONV operation performed by "LT" designation), that is, a convolution operation of 3-dimensional data. It can be understood that when only the convolution operation of 3-dimensional data is performed, both FPTOFIX and SHUFFLE, which are part of the main operation, can be set as optional operations.
以包括后处理操作的计算指令LCSU为例,其执行的操作(包括从处理电路执行的从运算操作和主处理电路执行的后处理操作)可以表达为:Taking the calculation instruction LCSU including post-processing operations as an example, the operations performed by it (including the slave operation operations performed by the slave processing circuit and the post-processing operations performed by the main processing circuit) can be expressed as:
LCSU=LT3DCONV+SUB,LCSU=LT3DCONV+SUB,
其中在从处理电路执行LT3DCONV操作以获得3D卷积结果后,可以由主处理电路中的减法器对3D卷积结果执行减法操作SUB。由此,在每个指令的执行周期,可以输入1个2元操作数(即卷积结果和减数),输出1个1元操作数(即执行LCSU指令后获得的最终结果)。Wherein, after the slave processing circuit performs the LT3DCONV operation to obtain the 3D convolution result, the subtractor in the main processing circuit can perform the subtraction operation SUB on the 3D convolution result. Thus, in the execution cycle of each instruction, one binary operand (that is, the convolution result and the subtrahend) can be input, and one unary operand (that is, the final result obtained after executing the LCSU instruction) can be output.
再以包括前处理操作、从运算操作和后处理操作的计算指令SHLCAD为例,其执行的操作(包括主处理电路执行的前处理操作、从处理电路执行的从运算操作和主处理电路执行的后处理操作)可以表达为:Taking the calculation instruction SHLCAD including the preprocessing operation, the slave operation operation and the postprocessing operation as an example, the operations performed by it (including the preprocessing operation performed by the main processing circuit, the slave operation operation performed by the slave processing circuit and the operation performed by the main processing circuit) post-processing operations) can be expressed as:
SHLCAD=SHUFFLE+LT3DCONV+ADDSHLCAD=SHUFFLE+LT3DCONV+ADD
其中在所述前处理操作中,数据拼接电路执行SHUFFLE表示的数据拼接操作。接着,由从处理电路对拼接后的数据执行LT3DCONV操作以获得以3D卷积结果。最后,由主处理电路中的加法器对3D卷积结果执行加法操作ADD以获得最终的计算结果。Wherein, in the preprocessing operation, the data splicing circuit performs the data splicing operation represented by SHUFFLE. Next, the LT3DCONV operation is performed on the concatenated data by the slave processing circuit to obtain a convolution result in 3D. Finally, the addition operation ADD is performed on the 3D convolution result by the adder in the main processing circuit to obtain the final calculation result.
通过上面的例子,本领域技术人员可以理解在对计算指令解析后,本披露所获得的运算指令根据具体的运算操作包括以下组合中的一项:前处理指令和从处理指令;从处理指令和后处理指令;以及前处理指令、从处理指令和后处理指令。基于此,在一些实施例中,所述前处理指令可以包括数据转换指令和/或数据拼接指令。在另一些实施例中,所述后处理指令包括以下中的一项或多项:随机数处理指令、加法指令、减法指令、查表指令、参数配置指令、乘法指令、池化指令、激活指令、比较指令、求绝对值指令、逻辑运算指令、位置索引指令或过滤指令。在另一些实施例中,所述从处理指令可以包括各种类型的运算指令,包括但不限于与后处理指令中相类似的指令以及针对于复杂数据处理的指令,例如向量运算指令或者张量运算指令。Through the above examples, those skilled in the art can understand that after parsing the calculation instructions, the operation instructions obtained by the present disclosure include one of the following combinations according to specific operation operations: preprocessing instructions and slave processing instructions; slave processing instructions and postprocessing instructions; and preprocessing instructions, slave processing instructions, and postprocessing instructions. Based on this, in some embodiments, the preprocessing instruction may include a data conversion instruction and/or a data splicing instruction. In other embodiments, the post-processing instructions include one or more of the following: random number processing instructions, addition instructions, subtraction instructions, table lookup instructions, parameter configuration instructions, multiplication instructions, pooling instructions, and activation instructions , comparison instruction, absolute value instruction, logical operation instruction, position index instruction or filter instruction. In other embodiments, the slave processing instructions may include various types of operation instructions, including but not limited to instructions similar to post-processing instructions and instructions for complex data processing, such as vector operation instructions or tensors Operation instructions.
基于上述结合图1(包括图1a和图1b)-图5的描述,本领域技术人员可以理解本披露同样也公开了一种使用计算装置来执行计算操作的方法,其中所述计算装置包括主处理电路和至少一个从处理电路(即前文结合图1~图5所讨论的计算装置),所述方法包括将所述主处理电路配置成响应于主指令来执行主运算操作,并且将所述从处理电路配置成响应于从指令来执行从运算操作。在一个实施例中,前述的主运算操作包括针对于所述从运算操作的前处理操作和/或后处理操作,所述主指令和所述从指令根据所述计算装置接收的计算指令解析得到。在另一个实施例中,所述计算指令的操作数包括用于指示张量的形状的描述符,所述描述符用于确定所述操作数对应数据的存储地址。Based on the above descriptions in conjunction with FIG. 1 (including FIG. 1a and FIG. 1b ) to FIG. 5 , those skilled in the art can understand that the present disclosure also discloses a method for performing computing operations using a computing device, wherein the computing device includes a host computer. A processing circuit and at least one slave processing circuit (ie, the computing device discussed above in connection with FIGS. 1-5 ), the method includes configuring the master processing circuit to perform a master arithmetic operation in response to a master instruction, and The slave processing circuit is configured to perform slave arithmetic operations in response to the slave instruction. In one embodiment, the aforementioned main operation includes a pre-processing operation and/or a post-processing operation for the slave operation, and the main instruction and the slave instruction are obtained by parsing according to the calculation instruction received by the computing device . In another embodiment, the operand of the calculation instruction includes a descriptor for indicating the shape of the tensor, and the descriptor is used to determine the storage address of the data corresponding to the operand.
基于上述的描述符设置,所述方法还可以包括将所述主处理电路和/或从处理电路配置成根据所述存储地址来执行各自对应的主运算操作和/或从处理操作。如前所述,通过本披露的描述符设置,可以提升张量运算的效率和数据访存的速率,进一步减小张量运算的开销。另外,尽管这里为了简明的目的,没有对方法的另外步骤进行描述,但本领域技术人员根据本披露公开的内容可以理解本披露的方法可以执行前文结合图1-图5所述的各类操作。Based on the above descriptor settings, the method may further include configuring the master processing circuit and/or the slave processing circuit to perform respective corresponding master arithmetic operations and/or slave processing operations according to the storage address. As mentioned above, through the descriptor setting of the present disclosure, the efficiency of tensor operation and the speed of data access can be improved, and the overhead of tensor operation can be further reduced. In addition, although other steps of the method are not described here for the purpose of brevity, those skilled in the art can understand from the content disclosed in the present disclosure that the method of the present disclosure can perform various operations described above in conjunction with FIGS. 1-5 . .
图6是示出根据本披露实施例的一种组合处理装置600的结构图。如图6中所示,该组合处理装置600包括计算处理装置602、接口装置604、其他处理装置606和存储装置608。根据不同的应用场景,计算处理装置中可以包括一个或多个计算装置610,该计算装置可以配置用于执行本文结合图1-图5所描述的操作。FIG. 6 is a structural diagram illustrating a combined processing apparatus 600 according to an embodiment of the present disclosure. As shown in FIG. 6 , the combined processing device 600 includes a computing processing device 602 , an interface device 604 , other processing devices 606 and a storage device 608 . According to different application scenarios, one or more computing devices 610 may be included in the computing processing device, and the computing devices may be configured to perform the operations described herein in conjunction with FIG. 1 to FIG. 5 .
在不同的实施例中,本披露的计算处理装置可以配置成执行用户指定的操作。在示例性的应用中,该计算处理装置可以实现为单核人工智能处理器或者多核人工智能处理器。类似地,包括在计算处理装置内的一个或多个计算装置可以实现为人工智能处理器核或者人工智能处理器核的部分硬件结构。当多个计算装置实现为人工智能处理器核或人工智能处理器核的部分硬件结构时,就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。In various embodiments, the computing processing devices of the present disclosure may be configured to perform user-specified operations. In an exemplary application, the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core. When multiple computing devices are implemented as an artificial intelligence processor core or a part of the hardware structure of an artificial intelligence processor core, for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure.
在示例性的操作中,本披露的计算处理装置可以通过接口装置与其他处理装置进行交互,以共同完成用户指定的操作。根据实现方式的不同,本披露的其他处理装 置可以包括中央处理器(Central Processing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)、人工智能处理器等通用和/或专用处理器中的一种或多种类型的处理器。这些处理器可以包括但不限于数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算处理装置和其他处理装置共同考虑时,二者可以视为形成异构多核结构。In an exemplary operation, the computing processing apparatus of the present disclosure may interact with other processing apparatuses through an interface apparatus to jointly complete an operation specified by a user. According to different implementations, other processing devices of the present disclosure may include central processing units (Central Processing Unit, CPU), graphics processing units (Graphics Processing Unit, GPU), artificial intelligence processors and other general-purpose and/or special-purpose processors. One or more types of processors. These processors may include, but are not limited to, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs. As mentioned above, only for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when computing processing devices and other processing devices are considered together, the two can be viewed as forming a heterogeneous multi-core structure.
在一个或多个实施例中,该其他处理装置可以作为本披露的计算处理装置(其可以具体化为人工智能例如神经网络运算的相关运算装置)与外部数据和控制的接口,执行包括但不限于数据搬运、对计算装置的开启和/或停止等基本控制。在另外的实施例中,其他处理装置也可以和该计算处理装置协作以共同完成运算任务。In one or more embodiments, the other processing device may serve as an interface for the computing processing device of the present disclosure (which may be embodied as a related computing device for artificial intelligence such as neural network operations) with external data and control, performing operations including but not limited to Limited to basic controls such as data movement, starting and/or stopping computing devices. In other embodiments, other processing apparatuses may also cooperate with the computing processing apparatus to jointly complete computing tasks.
在一个或多个实施例中,该接口装置可以用于在计算处理装置与其他处理装置间传输数据和控制指令。例如,该计算处理装置可以经由所述接口装置从其他处理装置中获取输入数据,写入该计算处理装置片上的存储装置(或称存储器)。进一步,该计算处理装置可以经由所述接口装置从其他处理装置中获取控制指令,写入计算处理装置片上的控制缓存中。替代地或可选地,接口装置也可以读取计算处理装置的存储装置中的数据并传输给其他处理装置。In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing and processing device may obtain input data from other processing devices via the interface device, and write the input data into the on-chip storage device (or memory) of the computing and processing device. Further, the computing and processing device may obtain control instructions from other processing devices via the interface device, and write them into the control cache on the computing and processing device chip. Alternatively or alternatively, the interface device can also read the data in the storage device of the computing processing device and transmit it to other processing devices.
附加地或可选地,本披露的组合处理装置还可以包括存储装置。如图中所示,该存储装置分别与所述计算处理装置和所述其他处理装置连接。在一个或多个实施例中,存储装置可以用于保存所述计算处理装置和/或所述其他处理装置的数据。例如,该数据可以是在计算处理装置或其他处理装置的内部或片上存储装置中无法全部保存的数据。Additionally or alternatively, the combined processing device of the present disclosure may also include a storage device. As shown in the figure, the storage device is connected to the computing processing device and the other processing device, respectively. In one or more embodiments, a storage device may be used to store data of the computing processing device and/or the other processing device. For example, the data may be data that cannot be fully stored in an internal or on-chip storage device of a computing processing device or other processing device.
在一些实施例里,本披露还公开了一种芯片(例如图7中示出的芯片702)。在一种实现中,该芯片是一种系统级芯片(System on Chip,SoC),并且集成有一个或多个如图6中所示的组合处理装置。该芯片可以通过对外接口装置(如图7中示出的对外接口装置706)与其他相关部件相连接。该相关部件可以例如是摄像头、显示器、鼠标、键盘、网卡或wifi接口。在一些应用场景中,该芯片上可以集成有其他处理单元(例如视频编解码器)和/或接口模块(例如DRAM接口)等。在一些实施例中,本披露还公开了一种芯片封装结构,其包括了上述芯片。在一些实施例里,本披露还公开了一种板卡,其包括上述的芯片封装结构。下面将结合图7对该板卡进行详细地描述。In some embodiments, the present disclosure also discloses a chip (eg, chip 702 shown in FIG. 7). In one implementation, the chip is a System on Chip (SoC) and integrates one or more combined processing devices as shown in FIG. 6 . The chip can be connected with other related components through an external interface device (such as the external interface device 706 shown in FIG. 7 ). The relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card or a wifi interface. In some application scenarios, other processing units (such as video codecs) and/or interface modules (such as DRAM interfaces) may be integrated on the chip. In some embodiments, the present disclosure also discloses a chip package structure including the above-mentioned chip. In some embodiments, the present disclosure also discloses a board including the above-mentioned chip package structure. The board will be described in detail below with reference to FIG. 7 .
图7是示出根据本披露实施例的一种板卡700的结构示意图。如图7中所示,该板卡包括用于存储数据的存储器件704,其包括一个或多个存储单元710。该存储器件可以通过例如总线等方式与控制器件708和上文所述的芯片702进行连接和数据传输。进一步,该板卡还包括对外接口装置706,其配置用于芯片(或芯片封装结构中的芯片)与外部设备712(例如服务器或计算机等)之间的数据中继或转接功能。例如,待处理的数据可以由外部设备通过对外接口装置传递至芯片。又例如,所述芯片的计算结果可以经由所述对外接口装置传送回外部设备。根据不同的应用场景,所 述对外接口装置可以具有不同的接口形式,例如其可以采用标准PCIE接口等。FIG. 7 is a schematic structural diagram illustrating a board 700 according to an embodiment of the present disclosure. As shown in FIG. 7 , the board includes a storage device 704 for storing data, which includes one or more storage units 710 . The storage device can be connected to the control device 708 and the chip 702 described above for connection and data transmission through, for example, a bus. Further, the board also includes an external interface device 706, which is configured for data relay or transfer function between the chip (or a chip in a chip package structure) and an external device 712 (such as a server or a computer, etc.). For example, the data to be processed can be transmitted to the chip by an external device through an external interface device. For another example, the calculation result of the chip may be transmitted back to the external device via the external interface device. According to different application scenarios, the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface and the like.
在一个或多个实施例中,本披露板卡中的控制器件可以配置用于对所述芯片的状态进行调控。为此,在一个应用场景中,该控制器件可以包括单片机(Micro Controller Unit,MCU),以用于对所述芯片的工作状态进行调控。In one or more embodiments, the control device in the board of the present disclosure may be configured to regulate the state of the chip. To this end, in an application scenario, the control device may include a single-chip microcomputer (Micro Controller Unit, MCU) for regulating the working state of the chip.
根据上述结合图6和图7的描述,本领域技术人员可以理解本披露也公开了一种电子设备或装置,其可以包括一个或多个上述板卡、一个或多个上述芯片和/或一个或多个上述组合处理装置。According to the above description in conjunction with FIG. 6 and FIG. 7 , those skilled in the art can understand that the present disclosure also discloses an electronic device or device, which may include one or more of the above-mentioned boards, one or more of the above-mentioned chips and/or one or a plurality of the above-mentioned combined processing devices.
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。According to different application scenarios, the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment. The vehicles include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph. The electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal. In one or more embodiments, the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be obtained from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Matching appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。It should be noted that, for the purpose of simplicity, the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行划分,而实际实现时也可以有另外的划分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。In terms of specific implementation, based on the disclosure and teaching of this disclosure, those skilled in the art can understand that several embodiments disclosed in this disclosure can also be implemented in other ways not disclosed herein. For example, as for each unit in the foregoing electronic device or apparatus embodiment, this article divides them on the basis of considering logical functions, and there may also be other division methods in actual implementation. As another example, multiple units or components may be combined or integrated into another system, or some features or functions of a unit or component may be selectively disabled. As far as the connection relationship between different units or components is concerned, the connections discussed above in conjunction with the accompanying drawings may be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。In this disclosure, units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units. The aforementioned components or elements may be co-located or distributed over multiple network elements. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.
在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本披露的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本披露实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。In some implementation scenarios, the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the aspects of the present disclosure are embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure. The aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如CPU、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。In other implementation scenarios, the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors. In this regard, the various types of devices described herein (eg, computing devices or other processing devices) may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
依据以下条款可更好地理解前述内容:The foregoing can be better understood in accordance with the following terms:
条款1、一种计算装置,包括主处理电路和至少一个从处理电路,其中:Clause 1. A computing device comprising a master processing circuit and at least one slave processing circuit, wherein:
所述主处理电路配置成响应于主指令来执行主运算操作,the main processing circuit is configured to perform main arithmetic operations in response to a main instruction,
所述从处理电路配置成响应于从指令来执行从运算操作,the slave processing circuit is configured to perform a slave arithmetic operation in response to a slave instruction,
其中,所述主运算操作包括针对于所述从运算操作的前处理操作和/或后处理操作,所述主指令和所述从指令根据所述计算装置接收的计算指令解析得到,其中所述计算指令的操作数包括用于指示张量的形状的描述符,所述描述符用于确定所述操作数对应数据的存储地址,Wherein, the master computing operation includes a pre-processing operation and/or a post-processing operation for the slave computing operation, the master instruction and the slave instruction are parsed and obtained according to the computing instruction received by the computing device, wherein the The operand of the calculation instruction includes a descriptor for indicating the shape of the tensor, and the descriptor is used to determine the storage address of the data corresponding to the operand,
其中所述主处理电路和/或从处理电路配置成根据所述存储地址来执行各自对应的主运算操作和/或从处理操作。The master processing circuit and/or the slave processing circuit are configured to perform respective corresponding master arithmetic operations and/or slave processing operations according to the storage addresses.
条款2、根据条款1所述的计算装置,其中所述计算指令包括描述符的标识和/或描述符的内容,所述描述符的内容包括表示张量数据的形状的至少一个形状参数。Clause 2. The computing device of clause 1, wherein the computing instruction includes an identification of a descriptor and/or content of the descriptor, the content of the descriptor including at least one shape parameter representing the shape of the tensor data.
条款3、根据条款2所述的计算装置,其中所述描述符的内容还包括表示张量数据的地址的至少一个地址参数。Clause 3. The computing device of clause 2, wherein the content of the descriptor further comprises at least one address parameter representing an address of tensor data.
条款4、根据条款3所述的计算装置,其中所述张量数据的地址参数包括所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址。Clause 4. The computing device of Clause 3, wherein the address parameter of the tensor data comprises a base address of a data base point of the descriptor in the data storage space of the tensor data.
条款5、根据条款4所述的计算装置,其中所述张量数据的形状参数包括以下至少一种:Clause 5. The computing device of Clause 4, wherein the shape parameter of the tensor data comprises at least one of the following:
所述数据存储空间在N个维度方向的至少一个方向上的尺寸、所述张量数据的存储区域在N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的偏移量、处于N个维度方向的对角位置的至少两个顶点相对于所述数据基准点的位置、所述描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系,其中N为大于或等于零的整数。The size of the data storage space in at least one of the N-dimensional directions, the size of the storage area of the tensor data in at least one of the N-dimensional directions, the size of the storage area in at least one of the N-dimensional directions The offset in one direction, the position of at least two vertices at diagonal positions in N dimension directions relative to the data reference point, the data description position of the tensor data indicated by the descriptor and the data address. The mapping relationship between , where N is an integer greater than or equal to zero.
条款6、根据条款1所述的计算装置,其中所述主处理电路配置成:Clause 6. The computing device of clause 1, wherein the main processing circuit is configured to:
获取所述计算指令并对所述计算指令进行解析,以得到所述主指令和所述从指令;以及obtaining the computing instruction and parsing the computing instruction to obtain the master instruction and the slave instruction; and
将所述从指令发送至所述从处理电路。The slave instruction is sent to the slave processing circuit.
条款7、根据条款1所述的计算装置,还包括控制电路,所述控制电路配置成:获取所述计算指令并对所述计算指令进行解析,以得到所述主指令和所述从指令;以及Clause 7. The computing device according to Clause 1, further comprising a control circuit, the control circuit configured to: obtain the calculation instruction and parse the calculation instruction to obtain the master instruction and the slave instruction; as well as
将所述主指令发送至所述主处理电路并且将所述从指令发送至所述从处理电路。The master instruction is sent to the master processing circuit and the slave instruction is sent to the slave processing circuit.
条款8、根据条款1所述的计算装置,其中所述主指令包括用于标识所述前处理操作和/或所述后处理操作的标识位。Clause 8. The computing device of clause 1, wherein the host instruction includes an identification bit for identifying the preprocessing operation and/or the postprocessing operation.
条款9、根据条款1所述的计算装置,其中所述计算指令包括用于区分所述主指令中的所述前处理操作和所述后处理操作的预设位。Clause 9. The computing device of Clause 1, wherein the computing instruction includes a preset bit for distinguishing the preprocessing operation and the postprocessing operation in the main instruction.
条款10、根据条款1所述的计算装置,其中所述主处理电路包括用于执行所述主运算操作的数据处理单元,并且所述数据处理单元包括用于执行数据转换操作的数据转换电路和/或用于执行数据拼接操作的数据拼接电路。 Clause 10. The computing device of clause 1, wherein the main processing circuit comprises a data processing unit for performing the main arithmetic operation, and the data processing unit comprises a data conversion circuit for performing a data conversion operation and /or a data splicing circuit for performing data splicing operations.
条款11、根据条款10所述的计算装置,其中所述数据转换电路包括一个或多个转换器,用于实现计算数据在多种不同数据类型之间的转换。 Clause 11. The computing device of clause 10, wherein the data conversion circuit includes one or more converters for enabling conversion of computational data between a plurality of different data types.
条款12、根据条款10所述的计算装置,其中所述数据拼接电路配置成以预定的位长对计算数据进行拆分,并且将拆分后获得的多个数据块按照预定顺序进行拼接。 Item 12. The computing device according to Item 10, wherein the data splicing circuit is configured to split the computing data with a predetermined bit length, and splicing a plurality of data blocks obtained after the splitting in a predetermined order.
条款13、根据条款1所述的计算装置,其中所述主处理电路包括一组或多组流水运算电路,所述每组流水运算电路形成一条运算流水线并且包括一个或多个运算器,其中当所述每组流水运算电路包括多个运算器时,所述多个运算器被连接并配置成根据所述主指令选择性地参与以执行所述主运算操作。Clause 13. The computing device of clause 1, wherein the main processing circuit comprises one or more sets of pipelined arithmetic circuits, each set of pipelined arithmetic circuits forming an arithmetic pipeline and comprising one or more operators, wherein when When each group of pipeline arithmetic circuits includes a plurality of arithmetic units, the plurality of arithmetic units are connected and configured to selectively participate in performing the main arithmetic operation according to the main instruction.
条款14、根据条款13所述的计算装置,其中所述主处理电路包括至少两条运算流水线,并且每条运算流水线包括以下中的一个或多个运算器或电路:Clause 14. The computing device of clause 13, wherein the main processing circuit comprises at least two arithmetic pipelines, and each arithmetic pipeline comprises one or more operators or circuits of:
随机数处理电路、加减电路、减法电路、查表电路、参数配置电路、乘法器、除法器、池化器、比较器、求绝对值电路、逻辑运算器、位置索引电路或过滤器。Random number processing circuit, addition and subtraction circuit, subtraction circuit, look-up table circuit, parameter configuration circuit, multiplier, divider, pooler, comparator, absolute value circuit, logic operator, position index circuit or filter.
条款15、根据条款1所述的计算装置,其中所述从处理电路包括用于执行所述从运算操作的多个运算电路,并且所述多个运算电路被连接并配置成执行多级流水的运算操作,其中所述运算电路包括乘法电路、比较电路、累加电路和转数电路中的一个或多个,以至少执行向量运算。Clause 15. The computing device of clause 1, wherein the slave processing circuit includes a plurality of arithmetic circuits for performing the slave arithmetic operations, and the plurality of arithmetic circuits are connected and configured to perform multi-stage pipelined An arithmetic operation, wherein the arithmetic circuit includes one or more of a multiplying circuit, a comparing circuit, an accumulating circuit, and a revolution number circuit to perform at least vector operations.
条款16、根据条款15所述的计算装置,其中所述从指令包括对经所述前处理操作的计算数据执行卷积运算的卷积指令,所述从处理电路配置成:Clause 16. The computing device of clause 15, wherein the slave instruction comprises a convolution instruction that performs a convolution operation on the computed data subjected to the preprocessing operation, the slave processing circuit configured to:
根据所述卷积指令对经所述前处理操作的计算数据执行卷积运算。A convolution operation is performed on the calculated data subjected to the preprocessing operation according to the convolution instruction.
条款17、一种集成电路芯片,包括根据条款1-16的任意一项所述的计算装置。Clause 17. An integrated circuit chip comprising the computing device of any of clauses 1-16.
条款18、一种板卡,包括根据条款17所述的集成电路芯片。Clause 18. A board comprising the integrated circuit chip of clause 17.
条款19、一种电子设备,包括根据条款17所述的集成电路芯片。Clause 19. An electronic device comprising the integrated circuit chip of clause 17.
条款20、一种使用计算装置来执行计算操作的方法,其中所述计算装置包括主处理电路和至少一个从处理电路,所述方法包括: Clause 20. A method of performing computing operations using a computing device, wherein the computing device includes a master processing circuit and at least one slave processing circuit, the method comprising:
将所述主处理电路配置成响应于主指令来执行主运算操作,configuring the main processing circuit to perform main arithmetic operations in response to a main instruction,
将所述从处理电路配置成响应于从指令来执行从运算操作,configuring the slave processing circuit to perform a slave arithmetic operation in response to a slave instruction,
其中所述主运算操作包括针对于所述从运算操作的前处理操作和/或后处理操作,所述主指令和所述从指令根据所述计算装置接收的计算指令解析得到,其中所述计算指令的操作数包括用于指示张量的形状的描述符,所述描述符用于确定所述操作数对应数据的存储地址,The master operation includes preprocessing and/or postprocessing for the slave operation, and the master instruction and the slave instruction are parsed and obtained according to the calculation instruction received by the computing device, wherein the calculation The operand of the instruction includes a descriptor for indicating the shape of the tensor, and the descriptor is used to determine the storage address of the data corresponding to the operand,
其中所述方法还包括将所述主处理电路和/或从处理电路配置成根据所述存储地址来执行各自对应的主运算操作和/或从处理操作。The method further includes configuring the master processing circuit and/or the slave processing circuit to perform respective corresponding master arithmetic operations and/or slave processing operations according to the storage address.
条款21、根据条款20所述的方法,其中所述计算指令包括描述符的标识和/或描述符的内容,所述描述符的内容包括表示张量数据的形状的至少一个形状参数。 Clause 21. The method of clause 20, wherein the computation instruction includes an identification of a descriptor and/or content of the descriptor, the content of the descriptor including at least one shape parameter representing the shape of the tensor data.
条款22、根据条款21所述的方法,其中所述描述符的内容还包括表示张量数据的地址的至少一个地址参数。 Clause 22. The method of clause 21, wherein the content of the descriptor further comprises at least one address parameter representing an address of tensor data.
条款23、根据条款22所述的方法,其中所述张量数据的地址参数包括所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址。 Clause 23. The method of clause 22, wherein the address parameter of the tensor data comprises a base address of a data base point of the descriptor in the data storage space of the tensor data.
条款24、根据条款23所述的方法,其中所述张量数据的形状参数包括以下至少一种: Clause 24. The method of clause 23, wherein the shape parameter of the tensor data comprises at least one of the following:
所述数据存储空间在N个维度方向的至少一个方向上的尺寸、所述张量数据的存储区域在N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的偏移量、处于N个维度方向的对角位置的至少两个顶点相对于所述数据基准点的位置、所述描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系,其中N为大于或等于零的整数。The size of the data storage space in at least one of the N-dimensional directions, the size of the storage area of the tensor data in at least one of the N-dimensional directions, the size of the storage area in at least one of the N-dimensional directions The offset in one direction, the position of at least two vertices at diagonal positions in N dimension directions relative to the data reference point, the data description position of the tensor data indicated by the descriptor and the data address. The mapping relationship between , where N is an integer greater than or equal to zero.
条款25、根据条款20所述的方法,其中将所述主处理电路配置成: Clause 25. The method of clause 20, wherein the main processing circuit is configured to:
获取所述计算指令并对所述计算指令进行解析,以得到所述主指令和所述从指令;以及obtaining the computing instruction and parsing the computing instruction to obtain the master instruction and the slave instruction; and
将所述从指令发送至所述从处理电路。The slave instruction is sent to the slave processing circuit.
条款26、根据条款20所述的方法,其中计算装置包括控制电路,所述方法还包括将控制电路配置成:Clause 26. The method of clause 20, wherein the computing device includes a control circuit, the method further comprising configuring the control circuit to:
获取所述计算指令并对所述计算指令进行解析,以得到所述主指令和所述从指令;以及obtaining the computing instruction and parsing the computing instruction to obtain the master instruction and the slave instruction; and
将所述主指令发送至所述主处理电路并且将所述从指令发送至所述从处理电路。The master instruction is sent to the master processing circuit and the slave instruction is sent to the slave processing circuit.
条款27、根据条款20所述的方法,其中所述主指令包括用于标识所述前处理操作和/或所述后处理操作的标识位。Clause 27. The method of clause 20, wherein the host instruction includes an identification bit for identifying the preprocessing operation and/or the postprocessing operation.
条款28、根据条款20所述的方法,其中所述计算指令包括用于区分所述主指令中的所述前处理操作和所述后处理操作的预设位。Clause 28. The method of clause 20, wherein the computing instruction includes preset bits for distinguishing between the preprocessing operations and the postprocessing operations in the host instruction.
条款29、根据条款20所述的方法,其中所述主处理电路包括数据处理单元,并且所述数据处理单元包括数据转换电路和/或数据拼接电路,所述方法包括将数据处理单元配置成执行所述主运算操作,并且将所述数据转换电路配置成执行数据转换操作,以及将所述数据拼接电路配置成执行数据拼接操作。Clause 29. The method of clause 20, wherein the main processing circuit comprises a data processing unit, and the data processing unit comprises a data conversion circuit and/or a data splicing circuit, the method comprising configuring the data processing unit to perform The main arithmetic operation is performed, and the data conversion circuit is configured to perform a data conversion operation, and the data stitching circuit is configured to perform a data stitching operation.
条款30、根据条款29所述的方法,其中所述数据转换电路包括一个或多个转换器,所述方法包括将一个或多个转换器配置成实现计算数据在多种不同数据类型之间的转换。Clause 30. The method of clause 29, wherein the data conversion circuit comprises one or more converters, the method comprising configuring the one or more converters to enable conversion of computational data between a plurality of different data types. convert.
条款31、根据条款29所述的方法,其中将所述数据拼接电路配置成以预定的位长对计算数据进行拆分,并且将拆分后获得的多个数据块按照预定顺序进行拼接。Item 31. The method of Item 29, wherein the data splicing circuit is configured to split the calculated data with a predetermined bit length, and splicing the plurality of data blocks obtained after the splitting in a predetermined order.
条款32、根据条款20所述的方法,其中所述主处理电路包括一组或多组流水运算电路,所述每组流水运算电路形成一条运算流水线并且包括一个或多个运算器,其中当所述每组流水运算电路包括多个运算器时,所述方法包括将所述多个运算器进行连接并且配置成根据所述主指令选择性地参与以执行所述主运算操作。Clause 32. The method of clause 20, wherein the main processing circuit comprises one or more sets of pipelined arithmetic circuits, each set of pipelined arithmetic circuits forming an arithmetic pipeline and comprising one or more operators, wherein when all When each set of pipeline arithmetic circuits includes a plurality of arithmetic units, the method includes connecting the plurality of arithmetic units and configuring to selectively participate in performing the main arithmetic operation according to the host instruction.
条款33、根据条款32所述的方法,其中所述主处理电路包括至少两条运算流水线,并且每条运算流水线包括以下中的一个或多个运算器或电路:Clause 33. The method of clause 32, wherein the main processing circuit comprises at least two arithmetic pipelines, and each arithmetic pipeline comprises one or more operators or circuits of:
随机数处理电路、加减电路、减法电路、查表电路、参数配置电路、乘法器、除法器、池化器、比较器、求绝对值电路、逻辑运算器、位置索引电路或过滤器。Random number processing circuit, addition and subtraction circuit, subtraction circuit, look-up table circuit, parameter configuration circuit, multiplier, divider, pooler, comparator, absolute value circuit, logic operator, position index circuit or filter.
条款34、根据条款20所述的方法,其中所述从处理电路包括多个运算电路,所述方法包括将所述多个运算电路配置成执行所述从运算操作,并且所述方法还包括将所述多个运算电路连接并配置成执行多级流水的运算操作,其中所述运算电路包括乘法电路、比较电路、累加电路和转数电路中的一个或多个,以至少执行向量运算。Clause 34. The method of clause 20, wherein the slave processing circuit comprises a plurality of arithmetic circuits, the method comprising configuring the plurality of arithmetic circuits to perform the slave arithmetic operations, and the method further comprising The plurality of arithmetic circuits are connected and configured to perform multi-stage pipeline arithmetic operations, wherein the arithmetic circuits include one or more of a multiplying circuit, a comparing circuit, an accumulating circuit, and a revolution number circuit to perform at least vector operations.
条款35、根据条款34所述的方法,其中所述从指令包括对经所述前处理操作的计算数据执行卷积运算的卷积指令,所述方法包括将所述从处理电路配置成:Clause 35. The method of clause 34, wherein the slave instruction comprises a convolution instruction that performs a convolution operation on computational data subjected to the preprocessing operation, the method comprising configuring the slave processing circuit to:
根据所述卷积指令对经所述前处理操作的计算数据执行卷积运算。A convolution operation is performed on the calculated data subjected to the preprocessing operation according to the convolution instruction.
虽然本文已经示出和描述了本披露的多个实施例,但对于本领域技术人员显而易见的是,这样的实施例只是以示例的方式来提供。本领域技术人员可以在不偏离本披露思想和精神的情况下想到许多更改、改变和替代的方式。应当理解的是在实践本披露的过程中,可以采用对本文所描述的本披露实施例的各种替代方案。所附权利要求书旨在限定本披露的保护范围,并因此覆盖这些权利要求范围内的等同或替代方案。While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes and substitutions may occur to those skilled in the art without departing from the spirit and spirit of this disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. The appended claims are intended to define the scope of the disclosure, and therefore to cover equivalents and alternatives within the scope of these claims.

Claims (35)

  1. 一种计算装置,包括主处理电路和至少一个从处理电路,其中:A computing device comprising a master processing circuit and at least one slave processing circuit, wherein:
    所述主处理电路配置成响应于主指令来执行主运算操作,the main processing circuit is configured to perform main arithmetic operations in response to a main instruction,
    所述从处理电路配置成响应于从指令来执行从运算操作,the slave processing circuit is configured to perform a slave arithmetic operation in response to a slave instruction,
    其中,所述主运算操作包括针对于所述从运算操作的前处理操作和/或后处理操作,所述主指令和所述从指令根据所述计算装置接收的计算指令解析得到,其中所述计算指令的操作数包括用于指示张量的形状的描述符,所述描述符用于确定所述操作数对应数据的存储地址,Wherein, the master computing operation includes a pre-processing operation and/or a post-processing operation for the slave computing operation, the master instruction and the slave instruction are parsed and obtained according to the computing instruction received by the computing device, wherein the The operand of the calculation instruction includes a descriptor for indicating the shape of the tensor, and the descriptor is used to determine the storage address of the data corresponding to the operand,
    其中所述主处理电路和/或从处理电路配置成根据所述存储地址来执行各自对应的主运算操作和/或从处理操作。The master processing circuit and/or the slave processing circuit are configured to perform respective corresponding master arithmetic operations and/or slave processing operations according to the storage addresses.
  2. 根据权利要求1所述的计算装置,其中所述计算指令包括描述符的标识和/或描述符的内容,所述描述符的内容包括表示张量数据的形状的至少一个形状参数。The computing device of claim 1, wherein the computing instruction includes an identification of a descriptor and/or content of the descriptor, the content of the descriptor including at least one shape parameter representing the shape of the tensor data.
  3. 根据权利要求2所述的计算装置,其中所述描述符的内容还包括表示张量数据的地址的至少一个地址参数。2. The computing device of claim 2, wherein the content of the descriptor further includes at least one address parameter representing an address of tensor data.
  4. 根据权利要求3所述的计算装置,其中所述张量数据的地址参数包括所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址。The computing device of claim 3, wherein the address parameter of the tensor data includes a reference address of a data reference point of the descriptor in a data storage space of the tensor data.
  5. 根据权利要求4所述的计算装置,其中所述张量数据的形状参数包括以下至少一种:The computing device of claim 4, wherein the shape parameter of the tensor data includes at least one of the following:
    所述数据存储空间在N个维度方向的至少一个方向上的尺寸、所述张量数据的存储区域在N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的偏移量、处于N个维度方向的对角位置的至少两个顶点相对于所述数据基准点的位置、所述描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系,其中N为大于或等于零的整数。The size of the data storage space in at least one of the N-dimensional directions, the size of the storage area of the tensor data in at least one of the N-dimensional directions, the size of the storage area in at least one of the N-dimensional directions The offset in one direction, the position of at least two vertices at diagonal positions in N dimension directions relative to the data reference point, the data description position of the tensor data indicated by the descriptor and the data address. The mapping relationship between , where N is an integer greater than or equal to zero.
  6. 根据权利要求1所述的计算装置,其中所述主处理电路配置成:The computing device of claim 1, wherein the main processing circuit is configured to:
    获取所述计算指令并对所述计算指令进行解析,以得到所述主指令和所述从指令;以及obtaining the computing instruction and parsing the computing instruction to obtain the master instruction and the slave instruction; and
    将所述从指令发送至所述从处理电路。The slave instruction is sent to the slave processing circuit.
  7. 根据权利要求1所述的计算装置,还包括控制电路,所述控制电路配置成:The computing device of claim 1, further comprising a control circuit configured to:
    获取所述计算指令并对所述计算指令进行解析,以得到所述主指令和所述从指令;以及obtaining the computing instruction and parsing the computing instruction to obtain the master instruction and the slave instruction; and
    将所述主指令发送至所述主处理电路并且将所述从指令发送至所述从处理电路。The master instruction is sent to the master processing circuit and the slave instruction is sent to the slave processing circuit.
  8. 根据权利要求1所述的计算装置,其中所述主指令包括用于标识所述前处理操作和/或所述后处理操作的标识位。The computing device of claim 1, wherein the host instruction includes an identification bit for identifying the preprocessing operation and/or the postprocessing operation.
  9. 根据权利要求1所述的计算装置,其中所述计算指令包括用于区分所述主指令中的所述前处理操作和所述后处理操作的预设位。The computing device of claim 1, wherein the computing instruction includes a preset bit for distinguishing the pre-processing operation and the post-processing operation in the main instruction.
  10. 根据权利要求1所述的计算装置,其中所述主处理电路包括用于执行所述主运算操作的数据处理单元,并且所述数据处理单元包括用于执行数据转换操作的数据转换电路和/或用于执行数据拼接操作的数据拼接电路。The computing device of claim 1, wherein the main processing circuit comprises a data processing unit for performing the main arithmetic operation, and the data processing unit comprises a data conversion circuit for performing a data conversion operation and/or A data stitching circuit for performing data stitching operations.
  11. 根据权利要求10所述的计算装置,其中所述数据转换电路包括一个或多个转换器,用于实现计算数据在多种不同数据类型之间的转换。11. The computing device of claim 10, wherein the data conversion circuit includes one or more converters for converting computational data between a plurality of different data types.
  12. 根据权利要求10所述的计算装置,其中所述数据拼接电路配置成以预定的位长对计算数据进行拆分,并且将拆分后获得的多个数据块按照预定顺序进行拼接。The computing device according to claim 10, wherein the data splicing circuit is configured to split the computing data with a predetermined bit length, and splicing a plurality of data blocks obtained after the splitting in a predetermined order.
  13. 根据权利要求1所述的计算装置,其中所述主处理电路包括一组或多组流水运算电路,所述每组流水运算电路形成一条运算流水线并且包括一个或多个运算器,其中当所述每组流水运算电路包括多个运算器时,所述多个运算器被连接并配置成根据所述主指令选择性地参与以执行所述主运算操作。The computing device of claim 1, wherein the main processing circuit comprises one or more sets of pipelined arithmetic circuits, each set of pipelined arithmetic circuits forming an arithmetic pipeline and comprising one or more operators, wherein when the When each group of pipeline arithmetic circuits includes a plurality of arithmetic units, the plurality of arithmetic units are connected and configured to selectively participate in performing the main arithmetic operation according to the main instruction.
  14. 根据权利要求13所述的计算装置,其中所述主处理电路包括至少两条运算流水线,并且每条运算流水线包括以下中的一个或多个运算器或电路:14. The computing device of claim 13, wherein the main processing circuit includes at least two arithmetic pipelines, and each arithmetic pipeline includes one or more operators or circuits of:
    随机数处理电路、加减电路、减法电路、查表电路、参数配置电路、乘法器、除法器、池化器、比较器、求绝对值电路、逻辑运算器、位置索引电路或过滤器。Random number processing circuit, addition and subtraction circuit, subtraction circuit, look-up table circuit, parameter configuration circuit, multiplier, divider, pooler, comparator, absolute value circuit, logic operator, position index circuit or filter.
  15. 根据权利要求1所述的计算装置,其中所述从处理电路包括用于执行所述从运算操作的多个运算电路,并且所述多个运算电路被连接并配置成执行多级流水的运算操作,其中所述运算电路包括乘法电路、比较电路、累加电路和转数电路中的一个或多个,以至少执行向量运算。The computing device of claim 1, wherein the slave processing circuit includes a plurality of arithmetic circuits for performing the slave arithmetic operations, and the plurality of arithmetic circuits are connected and configured to perform multi-stage pipelined arithmetic operations , wherein the operation circuit includes one or more of a multiplying circuit, a comparing circuit, an accumulating circuit, and a rotation number circuit to perform at least vector operations.
  16. 根据权利要求15所述的计算装置,其中所述从指令包括对经所述前处理操作的计算数据执行卷积运算的卷积指令,所述从处理电路配置成:16. The computing device of claim 15, wherein the slave instruction comprises a convolution instruction that performs a convolution operation on the calculated data subjected to the preprocessing operation, the slave processing circuit being configured to:
    根据所述卷积指令对经所述前处理操作的计算数据执行卷积运算。A convolution operation is performed on the calculated data subjected to the preprocessing operation according to the convolution instruction.
  17. 一种集成电路芯片,包括根据权利要求1-16的任意一项所述的计算装置。An integrated circuit chip comprising the computing device according to any one of claims 1-16.
  18. 一种板卡,包括根据权利要求17所述的集成电路芯片。A board, comprising the integrated circuit chip according to claim 17 .
  19. 一种电子设备,包括根据权利要求17所述的集成电路芯片。An electronic device comprising the integrated circuit chip of claim 17 .
  20. 一种使用计算装置来执行计算操作的方法,其中所述计算装置包括主处理电路和至少一个从处理电路,所述方法包括:A method of performing computing operations using a computing device, wherein the computing device includes a master processing circuit and at least one slave processing circuit, the method comprising:
    将所述主处理电路配置成响应于主指令来执行主运算操作,configuring the main processing circuit to perform main arithmetic operations in response to a main instruction,
    将所述从处理电路配置成响应于从指令来执行从运算操作,configuring the slave processing circuit to perform a slave arithmetic operation in response to a slave instruction,
    其中所述主运算操作包括针对于所述从运算操作的前处理操作和/或后处理操作,所述主指令和所述从指令根据所述计算装置接收的计算指令解析得到,其中所述计算指令的操作数包括用于指示张量的形状的描述符,所述描述符用于确定所述操作数对应数据的存储地址,The master operation includes preprocessing and/or postprocessing for the slave operation, and the master instruction and the slave instruction are parsed and obtained according to the calculation instruction received by the computing device, wherein the calculation The operand of the instruction includes a descriptor for indicating the shape of the tensor, and the descriptor is used to determine the storage address of the data corresponding to the operand,
    其中所述方法还包括将所述主处理电路和/或从处理电路配置成根据所述存储地址来执行各自对应的主运算操作和/或从处理操作。The method further includes configuring the master processing circuit and/or the slave processing circuit to perform respective corresponding master arithmetic operations and/or slave processing operations according to the storage address.
  21. 根据权利要求20所述的方法,其中所述计算指令包括描述符的标识和/或描述符的内容,所述描述符的内容包括表示张量数据的形状的至少一个形状参数。21. The method of claim 20, wherein the computation instruction includes an identification of a descriptor and/or the content of the descriptor including at least one shape parameter representing the shape of the tensor data.
  22. 根据权利要求21所述的方法,其中所述描述符的内容还包括表示张量数据的地址的至少一个地址参数。21. The method of claim 21, wherein the content of the descriptor further includes at least one address parameter representing an address of tensor data.
  23. 根据权利要求22所述的方法,其中所述张量数据的地址参数包括所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址。23. The method of claim 22, wherein the address parameter of the tensor data includes a reference address of a data reference point of the descriptor in the data storage space of the tensor data.
  24. 根据权利要求23所述的方法,其中所述张量数据的形状参数包括以下至少一种:The method of claim 23, wherein the shape parameter of the tensor data comprises at least one of the following:
    所述数据存储空间在N个维度方向的至少一个方向上的尺寸、所述张量数据的存储区域在N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的偏移量、处于N个维度方向的对角位置的至少两个顶点相对于所述数据基准点 的位置、所述描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系,其中N为大于或等于零的整数。The size of the data storage space in at least one of the N-dimensional directions, the size of the storage area of the tensor data in at least one of the N-dimensional directions, the size of the storage area in at least one of the N-dimensional directions The offset in one direction, the position of at least two vertices at diagonal positions in N dimension directions relative to the data reference point, the data description position of the tensor data indicated by the descriptor and the data address. The mapping relationship between , where N is an integer greater than or equal to zero.
  25. 根据权利要求20所述的方法,其中将所述主处理电路配置成:21. The method of claim 20, wherein the main processing circuit is configured to:
    获取所述计算指令并对所述计算指令进行解析,以得到所述主指令和所述从指令;以及obtaining the computing instruction and parsing the computing instruction to obtain the master instruction and the slave instruction; and
    将所述从指令发送至所述从处理电路。The slave instruction is sent to the slave processing circuit.
  26. 根据权利要求20所述的方法,其中计算装置包括控制电路,所述方法还包括将控制电路配置成:21. The method of claim 20, wherein the computing device includes a control circuit, the method further comprising configuring the control circuit to:
    获取所述计算指令并对所述计算指令进行解析,以得到所述主指令和所述从指令;以及obtaining the computing instruction and parsing the computing instruction to obtain the master instruction and the slave instruction; and
    将所述主指令发送至所述主处理电路并且将所述从指令发送至所述从处理电路。The master instruction is sent to the master processing circuit and the slave instruction is sent to the slave processing circuit.
  27. 根据权利要求20所述的方法,其中所述主指令包括用于标识所述前处理操作和/或所述后处理操作的标识位。21. The method of claim 20, wherein the main instruction includes an identification bit for identifying the pre-processing operation and/or the post-processing operation.
  28. 根据权利要求20所述的方法,其中所述计算指令包括用于区分所述主指令中的所述前处理操作和所述后处理操作的预设位。21. The method of claim 20, wherein the computing instructions include preset bits for distinguishing between the preprocessing operations and the postprocessing operations in the main instruction.
  29. 根据权利要求20所述的方法,其中所述主处理电路包括数据处理单元,并且所述数据处理单元包括数据转换电路和/或数据拼接电路,所述方法包括将数据处理单元配置成执行所述主运算操作,并且将所述数据转换电路配置成执行数据转换操作,以及将所述数据拼接电路配置成执行数据拼接操作。21. The method of claim 20, wherein the main processing circuit comprises a data processing unit and the data processing unit comprises a data conversion circuit and/or a data splicing circuit, the method comprising configuring the data processing unit to perform the A main arithmetic operation is performed, and the data conversion circuit is configured to perform a data conversion operation, and the data stitching circuit is configured to perform a data stitching operation.
  30. 根据权利要求29所述的方法,其中所述数据转换电路包括一个或多个转换器,所述方法包括将一个或多个转换器配置成实现计算数据在多种不同数据类型之间的转换。30. The method of claim 29, wherein the data conversion circuit comprises one or more converters, the method comprising configuring the one or more converters to effect conversion of computational data between a plurality of different data types.
  31. 根据权利要求29所述的方法,其中将所述数据拼接电路配置成以预定的位长对计算数据进行拆分,并且将拆分后获得的多个数据块按照预定顺序进行拼接。29. The method according to claim 29, wherein the data splicing circuit is configured to split the calculated data with a predetermined bit length, and splicing a plurality of data blocks obtained after the splitting in a predetermined order.
  32. 根据权利要求20所述的方法,其中所述主处理电路包括一组或多组流水运算电路,所述每组流水运算电路形成一条运算流水线并且包括一个或多个运算器,其中当所述每组流水运算电路包括多个运算器时,所述方法包括将所述多个运算器进行连接并且配置成根据所述主指令选择性地参与以执行所述主运算操作。21. The method of claim 20, wherein the main processing circuit includes one or more sets of pipelined arithmetic circuits, each set of pipelined arithmetic circuits forming an arithmetic pipeline and comprising one or more operators, wherein when each of the pipelined arithmetic circuits is formed When the group pipelined arithmetic circuit includes a plurality of arithmetic units, the method includes connecting the plurality of arithmetic units and configuring to selectively participate in performing the primary arithmetic operation according to the host instruction.
  33. 根据权利要求32所述的方法,其中所述主处理电路包括至少两条运算流水线,并且每条运算流水线包括以下中的一个或多个运算器或电路:33. The method of claim 32, wherein the main processing circuit includes at least two arithmetic pipelines, and each arithmetic pipeline includes one or more operators or circuits of:
    随机数处理电路、加减电路、减法电路、查表电路、参数配置电路、乘法器、除法器、池化器、比较器、求绝对值电路、逻辑运算器、位置索引电路或过滤器。Random number processing circuit, addition and subtraction circuit, subtraction circuit, look-up table circuit, parameter configuration circuit, multiplier, divider, pooler, comparator, absolute value circuit, logic operator, position index circuit or filter.
  34. 根据权利要求20所述的方法,其中所述从处理电路包括多个运算电路,所述方法包括将所述多个运算电路配置成执行所述从运算操作,并且所述方法还包括将所述多个运算电路连接并配置成执行多级流水的运算操作,其中所述运算电路包括乘法电路、比较电路、累加电路和转数电路中的一个或多个,以至少执行向量运算。21. The method of claim 20, wherein the slave processing circuit comprises a plurality of arithmetic circuits, the method comprising configuring the plurality of arithmetic circuits to perform the slave arithmetic operations, and the method further comprising A plurality of arithmetic circuits are connected and configured to perform multi-stage pipeline arithmetic operations, wherein the arithmetic circuits include one or more of a multiplying circuit, a comparing circuit, an accumulating circuit, and a revolution number circuit to perform at least vector operations.
  35. 根据权利要求34所述的方法,其中所述从指令包括对经所述前处理操作的计算数据执行卷积运算的卷积指令,所述方法包括将所述从处理电路配置成:35. The method of claim 34, wherein the slave instruction comprises a convolution instruction that performs a convolution operation on computational data subjected to the preprocessing operation, the method comprising configuring the slave processing circuit to:
    根据所述卷积指令对经所述前处理操作的计算数据执行卷积运算。A convolution operation is performed on the calculated data subjected to the preprocessing operation according to the convolution instruction.
PCT/CN2021/095705 2020-06-30 2021-05-25 Computing apparatus, integrated circuit chip, board card, electronic device, and computing method WO2022001500A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010619460.8A CN113867800A (en) 2020-06-30 2020-06-30 Computing device, integrated circuit chip, board card, electronic equipment and computing method
CN202010619460.8 2020-06-30

Publications (1)

Publication Number Publication Date
WO2022001500A1 true WO2022001500A1 (en) 2022-01-06

Family

ID=78981783

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/095705 WO2022001500A1 (en) 2020-06-30 2021-05-25 Computing apparatus, integrated circuit chip, board card, electronic device, and computing method

Country Status (2)

Country Link
CN (2) CN118012505A (en)
WO (1) WO2022001500A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115599442A (en) * 2022-12-14 2023-01-13 成都登临科技有限公司(Cn) AI chip, electronic equipment and tensor processing method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020890A (en) * 2012-12-17 2013-04-03 中国科学院半导体研究所 Visual processing device based on multi-layer parallel processing
CN107729990A (en) * 2017-07-20 2018-02-23 上海寒武纪信息科技有限公司 Support the device and method for being used to perform artificial neural network forward operation that discrete data represents
CN111047005A (en) * 2018-10-11 2020-04-21 上海寒武纪信息科技有限公司 Operation method, operation device, computer equipment and storage medium
CN111078286A (en) * 2018-10-19 2020-04-28 上海寒武纪信息科技有限公司 Data communication method, computing system and storage medium
US20200201932A1 (en) * 2019-12-28 2020-06-25 Intel Corporation Apparatuses, methods, and systems for instructions of a matrix operations accelerator

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8838997B2 (en) * 2012-09-28 2014-09-16 Intel Corporation Instruction set for message scheduling of SHA256 algorithm
US9785565B2 (en) * 2014-06-30 2017-10-10 Microunity Systems Engineering, Inc. System and methods for expandably wide processor instructions
US10762164B2 (en) * 2016-01-20 2020-09-01 Cambricon Technologies Corporation Limited Vector and matrix computing device
US10628295B2 (en) * 2017-12-26 2020-04-21 Samsung Electronics Co., Ltd. Computing mechanisms using lookup tables stored on memory
CN110096310B (en) * 2018-11-14 2021-09-03 上海寒武纪信息科技有限公司 Operation method, operation device, computer equipment and storage medium
CN111061507A (en) * 2018-10-16 2020-04-24 上海寒武纪信息科技有限公司 Operation method, operation device, computer equipment and storage medium
CN111079917B (en) * 2018-10-22 2023-08-11 北京地平线机器人技术研发有限公司 Tensor data block access method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020890A (en) * 2012-12-17 2013-04-03 中国科学院半导体研究所 Visual processing device based on multi-layer parallel processing
CN107729990A (en) * 2017-07-20 2018-02-23 上海寒武纪信息科技有限公司 Support the device and method for being used to perform artificial neural network forward operation that discrete data represents
CN111047005A (en) * 2018-10-11 2020-04-21 上海寒武纪信息科技有限公司 Operation method, operation device, computer equipment and storage medium
CN111078286A (en) * 2018-10-19 2020-04-28 上海寒武纪信息科技有限公司 Data communication method, computing system and storage medium
US20200201932A1 (en) * 2019-12-28 2020-06-25 Intel Corporation Apparatuses, methods, and systems for instructions of a matrix operations accelerator

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115599442A (en) * 2022-12-14 2023-01-13 成都登临科技有限公司(Cn) AI chip, electronic equipment and tensor processing method
CN115599442B (en) * 2022-12-14 2023-03-10 成都登临科技有限公司 AI chip, electronic equipment and tensor processing method

Also Published As

Publication number Publication date
CN113867800A (en) 2021-12-31
CN118012505A (en) 2024-05-10

Similar Documents

Publication Publication Date Title
US11531540B2 (en) Processing apparatus and processing method with dynamically configurable operation bit width
CN109522052B (en) Computing device and board card
CN109543832B (en) Computing device and board card
CN110096310B (en) Operation method, operation device, computer equipment and storage medium
TW201935265A (en) Computing device and method
CN110119807B (en) Operation method, operation device, computer equipment and storage medium
WO2022134873A1 (en) Data processing device, data processing method, and related product
CN109711540B (en) Computing device and board card
WO2022001500A1 (en) Computing apparatus, integrated circuit chip, board card, electronic device, and computing method
WO2022001497A1 (en) Computing apparatus, integrated circuit chip, board card, electronic device and computing method
CN109740730B (en) Operation method, device and related product
CN111047005A (en) Operation method, operation device, computer equipment and storage medium
WO2022001496A1 (en) Computing apparatus, integrated circuit chip, board card, electronic device, and computing method
CN114692844A (en) Data processing device, data processing method and related product
CN111047030A (en) Operation method, operation device, computer equipment and storage medium
JP7368512B2 (en) Computing equipment, integrated circuit chips, board cards, electronic devices and computing methods
CN112395009A (en) Operation method, operation device, computer equipment and storage medium
WO2022001499A1 (en) Computing apparatus, chip, board card, electronic device and computing method
WO2022001498A1 (en) Computing apparatus, integrated circuit chip, board, electronic device and computing method
WO2022134872A1 (en) Data processing apparatus, data processing method and related product
CN111290788B (en) Operation method, operation device, computer equipment and storage medium
CN112395002B (en) Operation method, device, computer equipment and storage medium
CN113792867B (en) Arithmetic circuit, chip and board card
CN111026440B (en) Operation method, operation device, computer equipment and storage medium
CN111124497B (en) Operation method, operation device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21831623

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21831623

Country of ref document: EP

Kind code of ref document: A1