CN114692841A - Data processing device, data processing method and related product - Google Patents

Data processing device, data processing method and related product Download PDF

Info

Publication number
CN114692841A
CN114692841A CN202011563257.XA CN202011563257A CN114692841A CN 114692841 A CN114692841 A CN 114692841A CN 202011563257 A CN202011563257 A CN 202011563257A CN 114692841 A CN114692841 A CN 114692841A
Authority
CN
China
Prior art keywords
data
sparse
instruction
thinned
tensor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011563257.XA
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Original Assignee
Cambricon Technologies Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambricon Technologies Corp Ltd filed Critical Cambricon Technologies Corp Ltd
Priority to CN202011563257.XA priority Critical patent/CN114692841A/en
Priority to PCT/CN2021/128189 priority patent/WO2022134873A1/en
Publication of CN114692841A publication Critical patent/CN114692841A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/06Addressing a physical block of locations, e.g. base addressing, module addressing, memory dedication
    • G06F12/0646Configuration or reconfiguration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30021Compare instructions, e.g. Greater-Than, Equal-To, MINMAX
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The present disclosure discloses a data processing apparatus, a data processing method, and a related product. The data processing apparatus may be implemented as a computing apparatus included in a combined processing apparatus, which may also include interface apparatus and other processing apparatus. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for storing data of the computing device and the other processing device. Aspects of the present disclosure provide specialized instructions for operations related to structured sparsity of tensor data that can simplify processing and improve processing efficiency of the machine.

Description

Data processing device, data processing method and related product
Technical Field
The present disclosure relates generally to the field of processors. More particularly, the present disclosure relates to a data processing apparatus, a data processing method, a chip, and a board.
Background
In recent years, with the rapid development of deep learning, the performance of algorithms in a series of fields such as computer vision and natural language processing has been developed in a cross-over manner. However, the deep learning algorithm is a calculation-intensive and storage-intensive tool, and with the increasing complexity of information processing tasks and the increasing requirements for the real-time performance and accuracy of the algorithm, the neural network is often designed to be deeper and deeper, so that the requirements for the calculation amount and the storage space are increased, and the existing artificial intelligence technology based on deep learning is difficult to be directly applied to mobile phones, satellites or embedded devices with limited hardware resources.
Therefore, compression, acceleration, optimization of the deep neural network model becomes of great importance. A large number of researches try to reduce the calculation and storage requirements of the neural network on the premise of not influencing the model precision, and have very important significance on the engineering application of the deep learning technology at an embedded end and a mobile end. Thinning is just one of the model lightweight methods.
The network parameter sparsification is to reduce redundant components in a larger network by a proper method so as to reduce the requirement of the network on the calculation amount and the storage space. Existing hardware and/or instruction sets do not efficiently support sparsification.
Disclosure of Invention
In order to at least partially solve one or more technical problems mentioned in the background, the present disclosure provides a data processing apparatus, a data processing method, a chip, and a board.
In a first aspect, the present disclosure discloses a data processing apparatus comprising: control circuitry configured to parse a sparse instruction, the sparse instruction indicating an operation related to structured sparsity and at least one operand of the sparse instruction including at least one descriptor indicating at least one of: shape information of tensor data and spatial information of tensor data; a tensor interface circuit configured to parse the descriptors; a storage circuit configured to store pre-sparsification and/or post-sparsification information; and an arithmetic circuit configured to perform a corresponding operation according to the sparse instruction based on the parsed descriptor.
In a second aspect, the present disclosure provides a chip comprising the data processing apparatus of any of the embodiments of the first aspect.
In a third aspect, the present disclosure provides a board card comprising the chip of any of the embodiments of the second aspect.
In a fourth aspect, the present disclosure provides a data processing method, the method comprising: parsing a sparse instruction, the sparse instruction indicating an operation related to structured sparsity and at least one operand of the sparse instruction including at least one descriptor indicating at least one of the following information: shape information of tensor data and spatial information of tensor data; parsing the descriptor; reading a corresponding operand based at least in part on the parsed descriptor; performing the structured sparsely related operation on the operand; and outputting the operation result.
With the data processing apparatus, the data processing method, the integrated circuit chip, and the board provided as above, the disclosed embodiments provide a sparse instruction for performing an operation related to structured sparseness of tensor data, wherein the tensor data is described by descriptors. In some embodiments, an operating mode bit may be included in the sparse instruction to indicate a different operating mode of the sparse instruction to perform different operations. In other embodiments, multiple sparse instructions may be included, each corresponding to one or more different modes of operation, to perform various operations related to structured sparsity. By providing specialized sparse instructions to perform operations related to structured sparseness of tensor data, processing may be simplified, thereby increasing the processing efficiency of the machine.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:
fig. 1 is a block diagram illustrating a board card according to an embodiment of the present disclosure;
FIG. 2 is a block diagram illustrating a combined processing device according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram illustrating an internal architecture of a single core computing device according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram illustrating the internal architecture of a multi-core computing device according to an embodiment of the disclosure;
FIG. 5 is an internal block diagram illustrating a processor core of an embodiment of the disclosure;
FIG. 6 shows a schematic diagram of a data storage space according to an embodiment of the present disclosure;
FIG. 7 shows a schematic diagram of data chunking in a data storage space, according to an embodiment of the present disclosure;
FIG. 8 is a schematic diagram showing the structure of a data processing apparatus of an embodiment of the present disclosure;
FIG. 9A is an exemplary pipeline operational circuit illustrating structured sparseness processing according to embodiments of the present disclosure;
FIG. 9B is an exemplary pipeline arithmetic circuit illustrating structured sparseness processing according to another embodiment of the present disclosure; and
fig. 10 is an exemplary flowchart illustrating a data processing method of an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.
It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects, and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".
Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.
Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the disclosure. As shown in fig. 1, the board card 10 includes a Chip 101, which is a System-on-Chip (SoC) or System-on-Chip, and is integrated with one or more combined processing devices, which are artificial intelligence arithmetic units, for supporting various deep learning and machine learning algorithms, and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining, and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.
The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.
The card 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 is configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).
Fig. 2 is a structural diagram showing a combined processing device in the chip 101 of this embodiment. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a storage device 204.
The computing device 201 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively perform the user-specified operations.
The interface device 202 is used for transmitting data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, and write to a storage device on the computing device 201. Further, the computing device 201 may obtain the control instruction from the processing device 203 via the interface device 202, and write the control instruction into a control cache on the computing device 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.
The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201. Depending on the implementation, the processing device 203 may be one or more types of Central Processing Unit (CPU), Graphics Processing Unit (GPU) or other general purpose and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be viewed as having a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 201 and the processing device 203 are considered to form a heterogeneous multi-core structure.
The storage device 204 is used to store data to be processed, which may be a DRAM, a DDR memory, and is typically 16G or larger in size, and is used to store data of the computing device 201 and/or the processing device 203.
Fig. 3 shows an internal structure diagram of the computing apparatus 201 as a single core. The single-core computing device 301 is used for processing input data such as computer vision, voice, natural language, data mining, and the like, and the single-core computing device 301 includes three modules: a control module 31, an operation module 32 and a storage module 33.
The control module 31 is used for coordinating and controlling the operations of the operation module 32 and the storage module 33 to complete the task of deep learning, and includes an Instruction Fetch Unit (IFU) 311 and an Instruction Decode Unit (IDU) 312. The instruction fetch unit 311 is used for obtaining an instruction from the processing device 203, and the instruction decode unit 312 decodes the obtained instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.
The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations, and can support complex operations such as vector multiplication, addition, nonlinear transformation, and the like; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, i.e., matrix multiplication and convolution.
The storage module 33 is used to store or transport related data, and includes a neuron storage unit (neuron RAM, NRAM)331, a parameter storage unit (weight RAM, WRAM)332, and a Direct Memory Access (DMA) 333. NRAM 331 is used to store input neurons, output neurons, and intermediate results after computation; WRAM 332 is used to store the convolution kernel of the deep learning network, i.e. the weight; the DMA 333 is connected to the DRAM204 via the bus 34 and is responsible for data transfer between the single-core computing device 301 and the DRAM 204.
Fig. 4 shows a schematic diagram of the internal structure of the computing apparatus 201 with multiple cores. The multi-core computing device 41 is designed in a hierarchical structure, and the multi-core computing device 41 is a system on a chip and includes at least one cluster (cluster), each cluster including a plurality of processor cores, in other words, the multi-core computing device 41 is formed in a system on a chip-cluster-processor core hierarchy.
In a system-on-chip hierarchy, as shown in FIG. 4, the multi-core computing device 41 includes an external storage controller 401, a peripheral communication module 402, an on-chip interconnect module 403, a synchronization module 404, and a plurality of clusters 405.
There may be multiple external memory controllers 401, 2 shown by way of example in the figure, for accessing an external memory device, such as DRAM204 in figure 2, to read data from or write data to off-chip in response to an access request issued by a processor core. The peripheral communication module 402 is used for receiving a control signal from the processing device 203 through the interface device 202 and starting the computing device 201 to execute a task. The on-chip interconnect module 403 connects the external memory controller 401, the peripheral communication module 402 and the plurality of clusters 405 for transmitting data and control signals between the respective modules. The synchronization module 404 is a global synchronization barrier controller (GBC) for coordinating the operation progress of the clusters and ensuring information synchronization. The plurality of clusters 405 are computing cores of the multi-core computing device 41, 4 are exemplarily shown in the figure, and as hardware advances, the multi-core computing device 41 of the present disclosure may further include 8, 16, 64, or even more clusters 405. The cluster 405 is used to efficiently execute deep learning algorithms.
Viewed at the cluster level, as shown in FIG. 4, each cluster 405 includes a plurality of processor cores (IPU core)406 and a memory core (MEM core) 407.
The processor cores 406 are exemplarily shown in 4 in the figure, and the present disclosure does not limit the number of the processor cores 406. The internal architecture is shown in fig. 5. Each processor core 406 is similar to the single-core computing device 301 of fig. 3, again including three major modules: a control module 51, an arithmetic module 52 and a storage module 53. The functions and structures of the control module 51, the operation module 52 and the storage module 53 are substantially the same as those of the control module 31, the operation module 32 and the storage module 33, and are not described again. It should be particularly noted that the storage module 53 includes an input/output direct memory access (IODMA) module 533 and a move direct memory access (MVDMA) module 534. IODMA 533 controls access of NRAM 531/WRAM 532 and DRAM204 through broadcast bus 409; the MVDMA 534 is used to control access to the NRAM 531/WRAM 532 and the memory cell (SRAM) 408.
Returning to FIG. 4, the storage core 407 is primarily used to store and communicate, i.e., store shared data or intermediate results among the processor cores 406, as well as perform communications between the cluster 405 and the DRAM204, communications among each other cluster 405, communications among each other processor cores 406, and the like. In other embodiments, memory core 407 has the capability of scalar operations to perform scalar operations.
The memory core 407 includes an SRAM 408, a broadcast bus 409, a Cluster Direct Memory Access (CDMA) module 410, and a Global Direct Memory Access (GDMA) module 411. The SRAM 408 plays a role of a high-performance data transfer station, data multiplexed between different processor cores 406 in the same cluster 405 does not need to be acquired to the DRAM204 through the processor cores 406 respectively, but is transferred among the processor cores 406 through the SRAM 408, and the memory core 407 only needs to rapidly distribute the multiplexed data from the SRAM 408 to a plurality of processor cores 406, so that the inter-core communication efficiency is improved, and on-chip and off-chip input/output access is greatly reduced.
The broadcast bus 409, CDMA 410, and GDMA 411 are used to perform communication between the processor cores 406, communication between the cluster 405, and data transmission between the cluster 405 and the DRAM204, respectively. As will be described separately below.
The broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405, and the broadcast bus 409 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to point-to-point (e.g., from a single processor core to a single processor core) data transfer, multicast is a communication that transfers a copy of data from SRAM 408 to a particular number of processor cores 406, and broadcast is a communication that transfers a copy of data from SRAM 408 to all processor cores 406, which is a special case of multicast.
CDMA 410 is used to control access to SRAM 408 between different clusters 405 within the same computing device 201.
The GDMA 411 cooperates with the external memory controller 401 to control the access of the SRAM 408 of the cluster 405 to the DRAM204 or to read data from the DRAM204 into the SRAM 408. As can be seen from the foregoing, communication between DRAM204 and NRAM 431 or WRAM 432 may be achieved via 2 channels. The first channel is to directly contact DRAM204 with NRAM 431 or WRAM 432 through IODAM 433; the second channel is that data is transferred between the DRAM204 and the SRAM 408 through the GDMA 411, and then transferred between the SRAM 408 and the NRAM 431 or WRAM 432 through the MVDMA 534. Although seemingly the second channel requires more components and the data flow is longer, in some embodiments, the bandwidth of the second channel is substantially greater than the first channel, and thus communication between DRAM204 and NRAM 431 or WRAM 432 may be more efficient over the second channel. Embodiments of the present disclosure may select a data transmission channel based on its own hardware conditions.
In other embodiments, the functions of GDMA 411 and IODMA 533 may be integrated in the same component. For convenience of description, the GDMA 411 and the IODMA 533 are considered as different components, and it is within the scope of the disclosure for those skilled in the art to achieve the same functions and achieve the same technical effects as the present disclosure. Further, the functions of GDMA 411, IODMA 533, CDMA 410, and MVDMA 534 may be implemented by the same component.
The instructions of conventional processors are designed to perform basic single data scalar operations. Here, a single data scalar operation refers to an instruction where each operand is a scalar datum. However, with the development of artificial intelligence technology, in tasks such as image processing and pattern recognition, the oriented operands are often data types of multidimensional vectors (i.e., tensor data), and the use of only scalar operations does not make hardware efficiently complete the operation task. Therefore, how to efficiently execute multidimensional tensor data processing is also an urgent problem to be solved in the current computing field.
In an embodiment of the present disclosure, a structured sparse instruction is provided for performing an operation related to structured sparseness of tensor data. At least one descriptor is included in at least one operand of the structured sparse instruction, by which information related to tensor data can be obtained. In particular, the descriptor may indicate at least one of the following information: shape information of tensor data, and spatial information of tensor data. Shape information of the tensor data can be used to determine the data address of the tensor data corresponding to the operand in the data storage space. The spatial information of the tensor data can be used to determine dependencies between instructions, which in turn can determine, for example, the order of execution of the instructions.
In one possible implementation, the spatial information of the tensor data may be indicated by a spatial Identification (ID). The space ID may also be referred to as a space alias, which refers to a space region for storing corresponding tensor data, and the space region may be a continuous space or a multi-segment space. Different spatial IDs indicate that there is no dependency on the spatial region pointed to.
Various possible implementations of shape information for tensor data are described in detail below in conjunction with the figures.
Tensors may contain many forms of data composition. The tensors may be of different dimensions, e.g. a scalar may be regarded as a 0-dimensional tensor, a vector may be regarded as a 1-dimensional tensor, and a matrix may be a 2-or higher-than-2-dimensional tensor. The shape of the tensor includes information such as the dimensions of the tensor, the sizes of the dimensions of the tensor, and the like. For example, for a three-dimensional tensor:
x3=[[[1,2,3],[4,5,6]];[[7,8,9],[10,11,12]]]
the shape or dimension of the tensor can be expressed as X3That is, the tensor is expressed as a three-dimensional tensor by three parameters, and the size of the tensor in the first dimension is 2, the size of the tensor in the second dimension is 2, and the size of the tensor in the third dimension is 3. When storing tensor data in a memory, the shape of the tensor data cannot be determined according to the data address (or the storage area), and further, related information such as the correlation among a plurality of tensor data cannot be determined, which results in low access efficiency of the processor to the tensor data.
In one possible implementation, the shape of the N-dimensional tensor data may be indicated by a descriptor, N being a positive integer, e.g., N ═ 1, 2, or 3, or zero. The three-dimensional tensor in the above example can be represented by descriptor (2,2, 3). It should be noted that the present disclosure is not limited to the way the descriptors indicate the tensor shape.
In one possible implementation, the value of N may be determined according to the dimension (also referred to as the order) of the tensor data, or may be set according to the usage requirement of the tensor data. For example, when the value of N is 3, the tensor data is three-dimensional tensor data, and the descriptor may be used to indicate the shape (e.g., offset, size, etc.) of the three-dimensional tensor data in three dimensional directions. It should be understood that the value of N can be set by those skilled in the art according to practical needs, and the disclosure does not limit this.
Although tensor data can be multidimensional, there is a correspondence between tensors and storage on memory because the layout of memory is always one-dimensional. Tensor data is typically allocated in contiguous memory space, i.e., the tensor data can be one-dimensionally expanded (e.g., line first) for storage on memory.
This relationship between the tensor and the underlying storage may be represented by an offset of a dimension (offset), a size of a dimension (size), a step size of a dimension (stride), and so on. The offset of a dimension refers to the offset in that dimension from a reference position. The size of a dimension refers to the size of the dimension, i.e., the number of elements in the dimension. The step size of a dimension refers to the interval between adjacent elements in the dimension, for example, the step size of the above three-dimensional tensor is (6,3,1), that is, the step size of the first dimension is 6, the step size of the second dimension is 3, and the step size of the third dimension is 1.
FIG. 6 shows a schematic diagram of a data storage space according to an embodiment of the present disclosure. As shown in fig. 6, the data storage space 61 stores a two-dimensional data in a line-first manner, which can be represented by (X, Y) (where the X axis is horizontally to the right and the Y axis is vertically downward). The size in the X-axis direction (the size of each row, or the total number of columns) is ori _ X (not shown), the size in the Y-axis direction (the total number of rows) is ori _ Y (not shown), and the starting address PA _ start (base address) of the data storage space 61 is the physical address of the first data block 62. The data block 63 is partial data in the data storage space 61, and its offset amount 65 in the X-axis direction is denoted as offset _ X, the offset amount 64 in the Y-axis direction is denoted as offset _ Y, the size in the X-axis direction is denoted as size _ X, and the size in the Y-axis direction is denoted as size _ Y.
In a possible implementation manner, when the data block 63 is defined by using a descriptor, a data reference point of the descriptor may use the first data block of the data storage space 61, and a reference address of the descriptor may be agreed as a starting address PA _ start of the data storage space 61. The content of the descriptor of the data block 63 may then be determined in combination with the size ori _ X of the data storage space 61 in the X axis, the size ori _ Y in the Y axis, and the offset amount offset _ Y of the data block 63 in the Y axis direction, the offset amount offset _ X in the X axis direction, the size _ X in the X axis direction, and the size _ Y in the Y axis direction.
In one possible implementation, the content of the descriptor can be represented using the following formula (1):
Figure BDA0002860963870000101
it should be understood that although the content of the descriptor is represented by a two-dimensional space in the above examples, a person skilled in the art can set the specific dimension of the content representation of the descriptor according to practical situations, and the disclosure does not limit this.
In one possible implementation manner, a reference address of the data reference point of the descriptor in the data storage space may be appointed, and based on the reference address, the content of the descriptor of the tensor data is determined according to the positions of at least two vertexes located at diagonal positions in the N dimensional directions relative to the data reference point.
For example, a reference address PA _ base of a data reference point of the descriptor in the data storage space may be agreed. For example, one data (for example, data with a position of (2, 2)) may be selected as a data reference point in the data storage space 61, and a physical address of the data in the data storage space may be used as a reference address PA _ base. The content of the descriptor of the data block 63 in fig. 6 can be determined from the positions of the two vertices of the diagonal position with respect to the data reference point. First, the positions of at least two vertices of the diagonal positions of the data block 63 with respect to the data reference point are determined, for example, using the positions of the diagonal position vertices with respect to the data reference point in the top-left to bottom-right direction, where the relative position of the top-left vertex is (x _ min, y _ min) and the relative position of the bottom-right vertex is (x _ max, y _ max), and then the content of the descriptor of the data block 63 can be determined according to the reference address PA _ base, the relative position of the top-left vertex (x _ min, y _ min), and the relative position of the bottom-right vertex (x _ max, y _ max).
In one possible implementation, the content of the descriptor (with reference to PA _ base) can be represented using the following equation (2):
Figure BDA0002860963870000111
it should be understood that although the above examples use the vertex of two diagonal positions of the upper left corner and the lower right corner to determine the content of the descriptor, the skilled person can set the specific vertex of at least two vertices of the diagonal positions according to the actual needs, and the disclosure does not limit this.
In one possible implementation, the content of the descriptor of the tensor data can be determined according to the reference address of the data reference point of the descriptor in the data storage space and the mapping relation between the data description position and the data address of the tensor data indicated by the descriptor. For example, when tensor data indicated by the descriptor is three-dimensional space data, the mapping relationship between the data description position and the data address may be defined by using a function f (x, y, z).
In one possible implementation, the content of the descriptor can be represented using the following equation (3):
Figure BDA0002860963870000121
in one possible implementation, the descriptor is further used to indicate an address of the N-dimensional tensor data, wherein the content of the descriptor further includes at least one address parameter representing the address of the tensor data, for example, the content of the descriptor may be the following formula (4):
Figure BDA0002860963870000122
where PA is the address parameter. The address parameter may be a logical address or a physical address. When the descriptor is analyzed, the PA may be used as any one of a vertex, a middle point, or a preset point of the vector shape, and the corresponding data address may be obtained by combining the shape parameters in the X direction and the Y direction.
In one possible implementation, the address parameter of the tensor data comprises a reference address of a data reference point of the descriptor in a data storage space of the tensor data, and the reference address comprises a start address of the data storage space.
In one possible implementation, the descriptor may further include at least one address parameter representing an address of the tensor data, for example, the content of the descriptor may be the following equation (5):
Figure BDA0002860963870000123
wherein PA _ start is a reference address parameter, which is not described again.
It should be understood that, the mapping relationship between the data description location and the data address can be set by those skilled in the art according to practical situations, and the disclosure does not limit this.
In a possible implementation manner, a default base address can be set in a task, the base address is used by descriptors in instructions in the task, and shape parameters based on the base address can be included in the descriptor contents. This base address may be determined by setting an environmental parameter for the task. The relevant description and usage of the base address can be found in the above embodiments. In this implementation, the contents of the descriptor can be mapped to the data address more quickly.
In one possible implementation, the reference address may be included in the content of each descriptor, and the reference address of each descriptor may be different. Compared with the mode of setting a common reference address by using the environment parameters, each descriptor in the mode can describe data more flexibly and use a larger data address space.
In one possible implementation, the data address in the data storage space of the data corresponding to the operand of the processing instruction may be determined according to the content of the descriptor. The calculation of the data address is automatically completed by hardware, and the calculation methods of the data address are different when the content of the descriptor is represented in different ways. The present disclosure is not limited to a particular calculation method of the data address.
For example, the content of the descriptor in the operand is expressed by formula (1), the amount of shift of the tensor data indicated by the descriptor in the data storage space is offset _ x and offset _ y, respectively, and the size is size _ x × size _ y, then the starting data address PA1 of the tensor data indicated by the descriptor in the data storage space is(x,y)The following equation (6) can be used to determine:
PA1(x,y)=PA_start+(offset_y-1)*ori_x+offset_x (6)
the data start address PA1 determined according to the above equation (6)(x,y)In combination with the offset amounts offset _ x and offset _ y and the size _ x and size _ y of the storage area, the storage area of the tensor data indicated by the descriptor in the data storage space can be determined.
In a possible implementation manner, when the operand further includes a data description location for the descriptor, a data address of data corresponding to the operand in the data storage space may be determined according to the content of the descriptor and the data description location. In this way, a portion of the data (e.g., one or more data) in the tensor data indicated by the descriptor may be processed.
For example, the content of the descriptor in the operand is expressed by formula (2), the tensor data indicated by the descriptor is respectively offset in the data storage space by offset _ x and offset _ y, the size is size _ x × size _ y, and the data description position for the descriptor included in the operand is (x) xq,yq) Then, the data address PA2 of the tensor data indicated by the descriptor in the data storage space(x,y)The following equation (7) may be used to determine:
PA2(x,y)=PA_start+(offset_y+yq-1)*ori_x+(offset_x+xq) (7)
in one possible implementation, the descriptor may indicate the data of the block. The data partitioning can effectively accelerate the operation speed and improve the processing efficiency in many applications. For example, in graphics processing, convolution operations often use data partitioning for fast arithmetic processing.
FIG. 7 shows a schematic diagram of data chunking in a data storage space according to an embodiment of the present disclosure. As shown in fig. 7, the data storage space 700 stores two-dimensional data in a line-first manner, which can be represented by (X, Y) (where the X-axis is horizontally to the right and the Y-axis is vertically to the bottom). The dimension in the X-axis direction (the dimension of each row, or the total number of columns) is ori _ X (not shown), and the dimension in the Y-axis direction (the total number of rows) is ori _ Y (not shown). Unlike the tensor data of fig. 6, the tensor data stored in fig. 7 includes a plurality of data blocks.
In this case, the descriptor requires more parameters to represent the data blocks. Taking the X axis (X dimension) as an example, the following parameters may be involved: ori _ x, x.tile.size (size in tile 702), x.tile.stride (step size in tile 704, i.e. the distance between the first point of the first tile and the first point of the second tile), x.tile.num (number of tiles, shown as 3 tiles), x.stride (overall step size, i.e. the distance from the first point of the first row to the first point of the second row), etc. Other dimensions may similarly include corresponding parameters.
In one possible implementation, the descriptor may include an identification of the descriptor and/or the content of the descriptor. The identifier of the descriptor is used to distinguish the descriptor, for example, the identifier of the descriptor may be its number; the content of the descriptor may include at least one shape parameter representing a shape of the tensor data. For example, the tensor data is 3-dimensional data, of three dimensions of the tensor data, in which shape parameters of two dimensions are fixed, the content of the descriptor thereof may include a shape parameter representing another dimension of the tensor data.
In one possible implementation, the identity and/or content of the descriptor may be stored in a descriptor storage space (internal memory), such as a register, an on-chip SRAM or other media cache, or the like. The tensor data indicated by the descriptors may be stored in a data storage space (internal memory or external memory), such as an on-chip cache or an off-chip memory, etc. The present disclosure does not limit the specific locations of the descriptor storage space and the data storage space.
In one possible implementation, the identity, content, and tensor data indicated by the descriptors may be stored in the same block of internal memory, e.g., a contiguous block of on-chip cache may be used to store the relevant content of the descriptors at addresses ADDR0-ADDR 1023. The addresses ADDR0-ADDR63 can be used as a descriptor storage space to store the identifier and content of the descriptor, and the addresses ADDR64-ADDR1023 can be used as a data storage space to store tensor data indicated by the descriptor. In the descriptor memory space, the identifiers of the descriptors may be stored with addresses ADDR0-ADDR31, and addresses ADDR32-ADDR 63. It should be understood that the address ADDR is not limited to 1 bit or one byte, and is used herein to mean one address, which is a unit of one address. The descriptor storage space, the data storage space, and their specific addresses may be determined by those skilled in the art in practice, and the present disclosure is not limited thereto.
In one possible implementation, the identity of the descriptor, the content, and the tensor data indicated by the descriptor may be stored in different areas of the internal memory. For example, a register may be used as a descriptor storage space, the identifier and the content of the descriptor may be stored in the register, an on-chip cache may be used as a data storage space, and tensor data indicated by the descriptor may be stored.
In one possible implementation, where a register is used to store the identity and content of a descriptor, the number of the register may be used to represent the identity of the descriptor. For example, when the number of the register is 0, the identifier of the descriptor stored therein is set to 0. When the descriptor in the register is valid, an area may be allocated in the buffer space for storing the tensor data according to the size of the tensor data indicated by the descriptor.
In one possible implementation, the identity and content of the descriptors may be stored in an internal memory and the tensor data indicated by the descriptors may be stored in an external memory. For example, the identification and content of the descriptors may be stored on-chip, and the tensor data indicated by the descriptors may be stored under-chip.
In one possible implementation, the data address of the data storage space corresponding to each descriptor may be a fixed address. For example, separate data storage spaces may be divided for tensor data, each of which has a one-to-one correspondence with descriptors at the start address of the data storage space. In this case, a circuit or module (e.g., an entity external to the disclosed computing device) responsible for parsing the computation instruction may determine the data address in the data storage space of the data corresponding to the operand from the descriptor.
In one possible implementation, when the data address of the data storage space corresponding to the descriptor is a variable address, the descriptor may be further used to indicate an address of the N-dimensional tensor data, wherein the content of the descriptor may further include at least one address parameter indicating the address of the tensor data. For example, the tensor data is 3-dimensional data, when the descriptor points to an address of the tensor data, the content of the descriptor may include one address parameter indicating the address of the tensor data, such as a starting physical address of the tensor data, or may include a plurality of address parameters of the address of the tensor data, such as a starting address of the tensor data + an address offset, or the tensor data is based on the address parameters of each dimension. The address parameters can be set by those skilled in the art according to practical needs, and the disclosure does not limit this.
In one possible implementation, the address parameter of the tensor data may include a reference address of a data reference point of the descriptor in a data storage space of the tensor data. Wherein, the reference address can be different according to the change of the data reference point. The present disclosure does not limit the selection of data reference points.
In one possible implementation, the base address may comprise a start address of the data storage space. When the data reference point of the descriptor is the first data block of the data storage space, the reference address of the descriptor is the start address of the data storage space. When the data reference point of the descriptor is data other than the first data block in the data storage space, the reference address of the descriptor is the address of the data block in the data storage space.
In one possible implementation, the shape parameters of the tensor data include at least one of: the size of the data storage space in at least one direction of the N dimensional directions, the size of the storage area in at least one direction of the N dimensional directions, the offset of the storage area in at least one direction of the N dimensional directions, the positions of at least two vertexes at diagonal positions of the N dimensional directions relative to the data reference point, and the mapping relationship between the data description position of tensor data indicated by the descriptor and the data address. Where the data description position is a mapping position of a point or a region in the tensor data indicated by the descriptor, for example, when the tensor data is 3-dimensional data, the descriptor may represent a shape of the tensor data using three-dimensional space coordinates (x, y, z), and the data description position of the tensor data may be a position of a point or a region in the three-dimensional space to which the tensor data is mapped, which is represented using three-dimensional space coordinates (x, y, z).
It should be understood that the shape parameters representing the tensor data may be selected by one of ordinary skill in the art based on the actual situation, and are not limited by the present disclosure. By using the descriptor in the data access process, the association between the data can be established, thereby reducing the complexity of data access and improving the instruction processing efficiency.
One embodiment of the present disclosure provides a data processing scheme based on the foregoing hardware environment, performing operations related to structured sparseness of tensor data according to specialized sparse instructions.
Fig. 8 shows a block diagram of a data processing apparatus 800 according to an embodiment of the present disclosure. The data processing apparatus 800 may be implemented, for example, in the computing apparatus 201 of fig. 2. As shown, the data processing apparatus 800 may include a control circuit 810, a tensor interface circuit 812, a storage circuit 820, and an arithmetic circuit 830.
The control circuit 810 may function similarly to the control module 31 of fig. 3 or the control module 51 of fig. 5, and may include, for example, an instruction fetch unit to fetch an instruction from, for example, the processing device 203 of fig. 2, and an instruction decode unit to decode the fetched instruction and send the decoded result as control information to the arithmetic circuit 830 and the storage circuit 820.
In one embodiment, the control circuitry 810 may be configured to parse a sparse instruction, wherein the sparse instruction indicates an operation related to structured sparsity and the at least one operand of the sparse instruction comprises at least one descriptor indicating at least one of the following information: shape information of tensor data and spatial information of tensor data.
Tensor Interface Unit (TIU) 812 may be configured to implement operations associated with the descriptors under control of control circuitry 810. These operations may include, but are not limited to, registration, modification, deregistration, resolution of descriptors; reading and writing descriptor content, etc. The present disclosure does not limit the specific hardware type of tensor interface circuit. In this way, the operation associated with the descriptor can be realized by dedicated hardware, and the access efficiency of tensor data is further improved.
In some embodiments, tensor interface circuit 812 may be configured to parse shape information of tensor data included in an operand of an instruction to determine a data address in the data storage space of data corresponding to the operand.
Alternatively or additionally, in still other embodiments, tensor interface circuit 812 may be configured to compare spatial information (e.g., spatial IDs) of tensor data included in operands of two instructions to determine dependencies of the two instructions to determine out-of-order execution, synchronization, etc. operations of the instructions.
Although control circuit 810 and tensor interface circuit 812 are shown in fig. 8 as two separate blocks, those skilled in the art will appreciate that these two units may also be implemented as one block or more blocks, and the present disclosure is not limited in this respect.
Storage circuitry 820 may be configured to store pre-sparsification and/or post-sparsification information. In one embodiment, the operands of the sparse instructions are weights of a neural network. In this embodiment, the memory circuit may be, for example, the WRAM 332 of fig. 3 or the WRAM 532 of fig. 5.
The arithmetic circuitry 830 may be configured to perform corresponding operations according to the sparse instruction based on the parsed descriptors.
In some embodiments, the arithmetic circuitry 830 may include one or more sets of pipelining circuits 831, where each set of pipelining circuits 831 may include one or more operators. When each set of pipelined arithmetic circuits includes a plurality of operators, the plurality of operators may be configured to perform a multi-stage pipelined arithmetic, i.e., to form a multi-stage arithmetic pipeline.
In some application scenarios, the pipelined arithmetic circuitry of the present disclosure may support operations related to structured sparseness. For example, in performing structured thinning-out processing, a multi-stage pipelined arithmetic circuit composed of circuits such as comparators may be employed to perform an operation of extracting n data elements from every m data elements as valid data elements, where m > n. In one implementation, m is 4 and n is 2. In other implementations, n may take other values, such as 1 or 3.
In one embodiment, the arithmetic circuit 830 may further include an arithmetic processing circuit 832, which may be configured to pre-process data before the pipelined arithmetic circuit 831 performs the arithmetic or post-process the data after the arithmetic according to the arithmetic instruction. In some application scenarios, the aforementioned pre-processing and post-processing may, for example, include data splitting and/or data splicing operations. In the structured sparse processing, the arithmetic processing circuit may split the data to be sparse in segments according to m data elements, and then send the split data to the pipeline arithmetic circuit 831 for processing.
FIG. 9A illustrates an exemplary operation pipeline for structured sparseness processing according to one embodiment of the present disclosure. In the embodiment of fig. 9A, a structured thinning process of screening out 2 data elements having a larger absolute value from among 4 data elements A, B, C and D when m is 4 and n is 2 is shown.
As shown in fig. 9A, the structured thinning-out processing described above can be performed by a 4-stage pipelined arithmetic circuit including an absolute value calculator and a comparator.
The first stage pipelined arithmetic circuit may include 4 absolute value operators 910 for synchronously performing absolute value operations on the 4 input data elements A, B, C and D, respectively.
The second stage pipelined arithmetic circuit may comprise two comparators for performing a block comparison of the 4 absolute values output by the previous stage. For example, the first comparator 921 may compare the absolute values of the data elements a and B and output a larger value Max00, and the second comparator 922 may compare the absolute values of the data elements C and D and output a larger value Max 10.
The third stage pipeline operation circuit may include a third comparator 930 for comparing 2 larger values Max00 and Max10 output from the previous stage and outputting a larger value Max 0. This larger value Max0 is the value with the largest absolute value among the 4 data elements.
The fourth stage pipeline operation circuit may include a fourth comparator 940 which compares the smaller value Min0 in the previous stage with another value in the packet in which the maximum value Max0 is located and outputs the larger value Max 1. This larger value Max1 is the second largest value of the absolute values of the 4 data elements.
Therefore, the four-out-of-two structured sparse processing can be realized through the 4-stage flow operation circuit.
FIG. 9B illustrates an exemplary operation pipeline for structured sparseness processing according to another embodiment of the present disclosure. Likewise, in the embodiment of fig. 9B, a structured thinning process is shown in which 2 data elements having larger absolute values are screened out from 4 data elements A, B, C and D when m is 4 and n is 2.
As shown in fig. 9B, the structured thinning-out processing described above can be performed by a multistage pipelined arithmetic circuit composed of an absolute value calculator, a comparator, and the like.
The first pipeline stage may include m (4) absolute value operators 950 for synchronously performing absolute value operations on the 4 input data elements A, B, C and D, respectively. To facilitate the final output of valid data elements, in some embodiments, the first pipeline stage may output both the original data elements (i.e., A, B, C and D) and the data after the absolute value operation (i.e., a, B, C, and D).
The second pipeline stage may include a permutation and combination circuit 960 for permutation and combination of the m absolute values to generate m sets of data, where each set of data includes the m absolute values, and the m absolute values are located at different positions in each set of data.
In some embodiments, the permutation combining circuit may be a circular shifter that circularly shifts the permutation of m absolute values (e.g., | A |, | B |, | C |, and | D |) m-1 times to generate m sets of data. For example, in the example shown in the figure, 4 sets of data are generated, respectively: { | A |, | B |, | C |, | D | }, { | B |, | C |, | D |, | A | }, { | C |, | D |, | A |, | B | } and { | D |, | A |, | B |, | C | }. Similarly, each group of data is output, and simultaneously, the corresponding original data element is also output, and each group of data corresponds to one original data element.
The third pipeline stage includes a comparison circuit 970 for comparing absolute values in the m sets of data and generating a comparison result.
In some embodiments, the third pipeline stage may include m comparison circuits, each comparison circuit including m-1 comparators (771, 772, 773), where m-1 comparators in the ith comparison circuit are configured to sequentially compare one absolute value in the ith set of data with three other absolute values and generate a comparison result, where 1 ≦ i ≦ m.
As can be seen, the third stream stage may also be considered an m-1(3) sub-stream stage. Each sub-waterline stage comprises m comparators for comparing its corresponding one of the absolute values with the other absolute values. m-1 sub-pipeline stages compare the corresponding absolute value with m-1 absolute values in turn.
For example, in the example shown in the figure, 4 comparators 971 in the first sub-flow stage are used to compare the first absolute value with the second absolute value in the 4 sets of data, respectively, and output comparison results w0, x0, y0, and z0, respectively. The 4 comparators 972 in the second sub-flow stage are configured to compare the first absolute value with the third absolute value in the 4 groups of data, and output comparison results w1, x1, y1, and z1, respectively. The 4 comparators 973 in the third sub-flow stage are configured to compare the first absolute value with the fourth absolute value in the 4 sets of data, and output comparison results w2, x2, y2, and z2, respectively.
Thus, a comparison of each absolute value with the other m-1 absolute values can be obtained.
In some embodiments, the comparison results may be represented using a bitmap. For example, at the 1 st comparator of the 1 st way compare circuit, when | a | ≧ | B |, w0 ≧ 1; at the 1 st 2 nd comparator, when | a | < | C |, w1 is 0; at the 3 rd comparator in way 1, when | a | ≧ D |, w2 ═ 1, and thus the output result of way 1 comparison circuit is { a, w0, w1, w2}, which is { a, 1, 0, 1 }. Similarly, the output result of the 2 nd way comparison circuit is { B, x0, x1, x2}, the output result of the 3 rd way comparison circuit is { C, y0, y1, y2}, and the output result of the 4 th way comparison circuit is { D, z0, z1, z2 }.
The fourth pipeline stage includes a filtering circuit 980 for selecting n data elements with larger absolute values from the m data elements as valid data elements according to the comparison result of the third stage, and outputting the valid data elements and corresponding indexes. The index is used to indicate the position of these valid data elements among the input m data elements. For example, when a and C are screened from A, B, C, D four data elements, their corresponding indices may be 0 and 2.
Based on the comparison, appropriate logic can be designed to select the n data elements with larger absolute values. In view of the fact that multiple data elements of the same absolute value may occur, in a further embodiment, when there are data elements of the same absolute value, the selection is made in a specified priority order. For example, a may be set to have the highest priority and D may be set to have the lowest priority in such a manner that the priorities are fixed from low to high in the index. In one example, when the absolute values of the A, C, D numbers are all the same and greater than the absolute value of B, the data selected are A and C.
From the foregoing comparison results, it can be analyzed that | A | is several numbers greater than { | B |, | C |, | D | } from w0, w1, and w 2. If w0, w1, and w2 are all 1, then a is selected because it means | a | is larger than | B |, | C |, and | D |, which is the maximum of the four numbers. If there are two 1 s in w0, w1, and w2, then this indicates that | A | is the next largest of the four absolute values, and therefore A is also chosen. Otherwise, a is not selected. Thus, in some embodiments, the determination may be analyzed based on the number of occurrences of these values.
In one implementation, the valid data elements may be selected based on the following logic. First, the number of times each data is larger than the other data may be counted. For example, define NA=sum_w=w0+w1+w2,NB=sum_x=x0+x1+x2,NC=sum_y=y0+y1+y2,NDSum _ z is z0+ z1+ z 2. Subsequently, the judgment and selection are performed under the following conditions.
The conditions for selecting a were: n is a radical ofA3, or NA2 and NB/NC/NDOnly one of 3;
the conditions for selecting B were: n is a radical ofB3, or NB2 and NA/NC/NDOnly one of 3, and NA≠2;
The conditions for selecting C were: n is a radical ofCIs equal to 3, and NA/NBAt most one 3, or NC2 and NA/NB/NDOnly one of 3, and NA/NBNone of 2;
the conditions for selecting D were: n is a radical ofDIs equal to 3, and NA/NB/NCAt most one 3, or ND2 and NA/NB/NCOnly one of 3, and NA/NB/NCThere is no 2.
Those skilled in the art will appreciate that there is some redundancy in the above logic in order to ensure selection at a predetermined priority. Based on the size and order information provided by the comparison, one skilled in the art may devise other logic to implement the screening of valid data elements, and the disclosure is not limited in this respect. Thus, the multi-stage pipeline arithmetic circuit of fig. 9B can also realize two-out-of-four structured thinning-out processing.
Those skilled in the art will appreciate that other forms of pipelined arithmetic circuits may also be designed to implement structured sparseness, and the present disclosure is not limited in this respect.
As mentioned previously, the operands of the sparse instructions may be data in a neural network, such as weights, neurons, and the like. Data in neural networks typically contain multiple dimensions. For example, in a convolutional neural network, data may exist in four dimensions: input channel, output channel, length, and width. In some embodiments, the sparse instruction may be used for structured sparse processing of at least one dimension of multidimensional data in a neural network. In particular, in one implementation, the sparse instructions may be used for structured sparse processing of input channel dimensions of multidimensional data in a neural network, for example, in an inference process or a forward training process of the neural network. In another implementation, the sparse instructions may be used to simultaneously structure sparsity the input channel dimensions and output channel dimensions of multidimensional data in a neural network, for example during reverse training of the neural network.
In one embodiment, in response to receiving a plurality of sparse instructions, one or more multi-stage pipelined arithmetic circuits of the present disclosure may be configured to perform multiple data operations, such as executing single instruction multiple data ("SIMD") instructions. In another embodiment, the plurality of operations performed by the operation circuits of each stage are predetermined according to functions supported by the plurality of operation circuits arranged stage by stage in the multistage operation pipeline.
In the context of the present disclosure, the aforementioned plurality of sparse instructions may be microinstructions or control signals that are executed within one or more multi-stage operation pipelines, which may include (or indicate) one or more operation operations to be performed by the multi-stage operation pipelines. Depending on different operational scenarios, the operational operations may include, but are not limited to, arithmetic operations such as convolution operations, matrix multiplication operations, logical operations such as and operations, xor operations, or operations, shift operations, or any combination of the foregoing types of operational operations.
FIG. 10 illustrates an exemplary flow diagram of a data processing method 1000 in accordance with an embodiment of the disclosure.
As shown in fig. 10, in step 1010, a sparse instruction is parsed, the sparse instruction indicating an operation related to structured sparsity, and at least one operand of the sparse instruction includes at least one descriptor indicating at least one of the following information: shape information of the tensor data and spatial information of the tensor data. This step may be performed, for example, by control circuitry 810 of fig. 8.
Next, in step 1020, the descriptor is parsed. This step may be performed, for example, by tensor interface circuit 812 of figure 8. Specifically, the data address of the tensor data corresponding to the operand in the data storage space can be determined according to the shape information of the tensor data; and/or determining dependencies between instructions based on spatial information of the tensor data.
Next, in step 1030, the corresponding operand is read based at least in part on the parsed descriptor. When the operand is tensor data, a data address can be obtained according to the resolved descriptor, so that corresponding data can be read. The sparse instructions may indicate different modes of operation, with corresponding operands being different, as will be described in detail below. This step may be performed, for example, by control circuitry 810 of fig. 8 for storage circuitry 820.
Next, in step 1030, the operations related to structured sparseness are performed on the operands that are read. This step may be performed, for example, by the arithmetic circuit 830 of fig. 8.
Finally, in step 1040, the operation result is output. For example, the operation result may be output by the arithmetic circuit 830 to the storage circuit 820 for subsequent use.
Operations related to structured sparsity may exist in various forms, such as structured sparsity processing, anti-sparsity processing, and the like. Various instruction schemes may be devised to implement the structured sparsity-related operations.
In one scheme, a sparse instruction may be designed, and an operation mode bit may be included in the instruction to indicate different operation modes of the sparse instruction, so as to perform different operations.
In another scheme, a plurality of sparse instructions may be designed, each instruction corresponding to one or more different operation modes, so as to execute different operations. In one implementation, a corresponding sparse instruction may be designed for each mode of operation. In another implementation, the operation modes can be classified according to their characteristics, and a sparse instruction is designed for each type of operation mode. Further, when multiple operating modes are included in a certain class of operating modes, an operating mode bit may be included in the sparse instruction to indicate the respective operating mode.
Regardless of the scheme, the sparse instruction may indicate its corresponding mode of operation via an operating mode bit and/or the instruction itself.
In one embodiment, the sparse instruction may indicate the first mode of operation. In a first mode of operation, operands of the sparse instruction include data to be sparse. At this time, the arithmetic circuit 830 may be configured to perform the structural thinning-out processing on the data to be thinned out according to the thinning-out instruction, and output the thinned-out structure to the storage circuit 820.
The structured thinning-out processing in the first operation mode may be structured thinning-out processing of a predetermined filtering rule, for example, according to a rule for filtering out a larger absolute value, n data elements with a larger absolute value are filtered out from every m data elements as valid data elements. The operational circuitry 830 may be configured, for example, as pipelined operational circuitry described with reference to fig. 9 to perform this structured sparseness processing.
The result after the sparsification processing comprises two parts: a data portion and an index portion. The data part comprises data after the data to be thinned are thinned, namely effective data elements extracted according to a screening rule of structured thinning processing. The index portion is used to indicate the position of the thinned data, that is, the effective data element, in the data before thinning (that is, the data to be thinned).
The structure in the embodiments of the present disclosure includes a data portion and an index portion that are bound to each other. In some embodiments, every 1 bit in the index portion may correspond to a data element. For example, when the data type is fix8, one data element is 8 bits, and every 1 bit in the index portion may correspond to 8 bits of data. In other embodiments, each 1 bit in the index portion of the fabric may be set to a location corresponding to N bits of data, N being determined based at least in part on the hardware configuration, allowing for subsequent implementation at the hardware level when the fabric is used. For example, it may be set that every 1 bit of the index portion in the structure body corresponds to a position of 4-bit data. For example, when the data type is fix8, every 2 bits in the index portion correspond to a data element of the fix8 type. In some embodiments, the data portions of the structure may be aligned according to a first alignment requirement and the index portions of the structure may be aligned according to a second alignment requirement, such that the entire structure also satisfies the alignment requirements. For example, the data portions may be aligned as 64B, the index portions may be aligned as 32B, and the entire structure may be aligned as 96B (64B + 32B). By the alignment requirement, the memory access times can be reduced during subsequent use, and the processing efficiency is improved.
By using such a structure, the data part and the index part can be used collectively. Since the proportion of the valid data elements occupying all the data elements in the structured thinning-out process is fixed, for example, n/m, the data size after the thinning-out process is also fixed or predictable. Thus, the structural body can be densely stored in the memory circuit without a performance loss.
In another embodiment, the sparse instruction may indicate the second mode of operation. The second operation mode differs from the first operation mode in that the content of the output is different, and the second operation mode outputs only the structured thinning-out processed data portion without outputting the index portion.
Similarly, in the second mode of operation, the operands of the sparse instruction include data to be thinned out. At this time, the arithmetic circuit 830 may be configured to perform the structured thinning-out processing on the data to be thinned out according to the thinning-out instruction, and output the thinned-out data portion to the storage circuit 820. The data part comprises data after the data to be thinned are thinned. The data part is densely stored in the storage circuit. The output data portion is aligned by n elements. For example, in the example where m is 4 and n is 2, the input data to be thinned is aligned by 4 elements, and the output data portion is aligned by 2 elements.
In yet another embodiment, the sparse instruction may indicate a third mode of operation. The third operation mode is different from the first operation mode in that the content of the output is different, and the third operation mode outputs only the index portion after structured thinning processing without outputting the data portion.
Similarly, in a third mode of operation, the operands of the sparse instruction include data to be thinned out. At this time, the arithmetic circuit 830 may be configured to perform the structured thinning-out processing on the data to be thinned out according to the thinning-out instruction, and output the index portion after the thinning-out processing to the storage circuit 820. The index portion indicates the original position of the thinned data in the data to be thinned. The index part is densely stored in the storage circuit. Every 1 bit in the output index portion corresponds to the position of one data element. Since the index portion may be used alone, for example, for structured sparseness of neurons in subsequent convolution processing, while the data type of the neurons may be uncertain, the separately stored index portion may be adapted to various data types by corresponding every 1 bit in the index portion to the position of one data element.
In yet another embodiment, the sparse instruction may indicate a fourth mode of operation. The fourth operation mode is different from the first operation mode in that the fourth operation mode specifies a filtering rule of the structured thinning-out processing, instead of performing the structured thinning-out processing in accordance with a predetermined filtering rule (for example, the foregoing rule of larger absolute value). At this time, the sparse instruction has two operands: data to be thinned and a sparse index. The operand of the added sparse index is used to indicate the position of the valid data element in the structured sparsity to be performed, i.e. to specify the filtering rules for the structured sparsity processing. Each 1 bit in the sparse index corresponds to the position of one data element, so that the sparse index can be suitable for data to be thinned of various data types.
In a fourth operation mode, the arithmetic circuit 830 may be configured to perform structured sparse processing on the data to be thinned according to the sparse instruction and according to the position indicated by the sparse index, and output a result after the sparse processing to the storage circuit. In one implementation, the output result may be the thinned structure. In another implementation, the output result may be a thinned out portion of the data.
The meaning of the structure body is the same as that in the first operation mode, the structure body comprises a data part and an index part which are mutually bound, the data part comprises data after the data to be thinned is subjected to thinning processing, and the index part is used for indicating the original position of the thinned data in the data to be thinned. Alignment requirements, correspondence, etc. for the data portion, the index portion, and the like in the structure are the same as in the first operation mode, and are not repeated here.
The above four operation modes provide structured thinning processing of data, for example, processing according to a predetermined filtering rule or according to a filtering rule specified by an operand of an instruction, and provide different output contents, for example, an output structure body, an output-only data portion, an output-only index portion, and the like, respectively. The instruction design can well support structured sparse processing, and provides various output options to adapt to different scene requirements, for example, when data and indexes are required to be bound for use, an output structure body can be selected, and when the index part or the data part is required to be used independently, only the index part or the data part can be selected to be output.
In yet another embodiment, the sparse instruction may indicate a fifth mode of operation. The fifth operation mode does not need structured sparse processing, and only needs to bind the separated or independent data part and the index part into a structure.
In a fifth mode of operation, the operands of the sparse instruction include the sparsified data portion and a corresponding index portion. The data portion and the index portion are each in a compact storage format, but are not bound. The input data portion is aligned by n elements. For example, in the example of m-4 and n-2, the input data portion is aligned by 2 elements. The index portion indicates an original position of the data portion in the data before the thinning-out process, wherein each 1 bit of the index portion corresponds to one data element.
At this time, the arithmetic circuit 830 may be configured to bind the data part and the index part into a structure according to the thinning-out instruction, and output the structure to the storage circuit. The meaning of the structure, the alignment requirements for the data portion, the index portion, the correspondence, and the like are the same as in the first operation mode, and are not repeated here. Depending on the data type of the data element, the index portion in the structure needs to be generated accordingly based on the data type and the bit correspondence of the index portion in the structure. For example, when the input index portion is 0011, where each 1 bit corresponds to one data element, if the data type is fix8, that is, each data element has 8 bits, then according to the corresponding relationship of the index portion in the structure corresponding to 4 bits of data per 1 bit, the index portion in the structure should be: 00001111, i.e. 2 bits corresponds to one data element.
In yet another embodiment, the sparse instruction may indicate a sixth mode of operation. The sixth operation mode is used to perform anti-sparsification processing, that is, to restore the data after sparsification to the data format or scale before sparsification.
In a sixth mode of operation, the operands of the sparse instruction include a thinned data portion and a corresponding index portion, which are each in a dense storage format, but are not bound. The input data portion is aligned by n elements. For example, in the example where m is 4 and n is 2, the input data portion is aligned by 2 elements, and the output data portion is aligned by 4 elements. The index portion indicates an original position of the data portion in the data before the thinning-out process, wherein each 1 bit of the index portion corresponds to one data element.
At this time, the arithmetic circuit 830 may be configured to perform the anti-thinning processing on the input data portion in accordance with the position indicated by the input index portion in accordance with the thinning instruction to generate the restored data having the data format before the thinning processing, and output the restored data to the storage circuit.
In one implementation, the anti-sparsification process may include: according to the position indicated by the index part, according to the data format before the thinning processing, the data elements in the data part are respectively placed at the corresponding positions of the data format before the thinning processing, and the rest positions of the data format are filled with predetermined information (for example, 0) to generate the recovery data.
From the foregoing description, it can be seen that the disclosed embodiments provide a sparse instruction for performing operations related to structured sparsity. These operations may include forward structured sparsification operations, and may also include anti-sparsification operations, and may also include some associated format conversion operations. In some embodiments, an operating mode bit may be included in the sparse instruction to indicate a different operating mode of the sparse instruction to perform different operations. In other embodiments, multiple sparse instructions may be provided directly, each corresponding to one or more different modes of operation, to perform various operations related to structured sparsity. By providing specialized sparse instructions to perform operations related to structured sparsity, processing may be simplified, thereby increasing the processing efficiency of the machine.
According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud end, an edge end, and a terminal. In one or more embodiments, a computationally powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or apparatus may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.
It is noted that for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of acts and combinations thereof, but those skilled in the art will appreciate that the aspects of the present disclosure are not limited by the order of the acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in that the acts or modules involved are not necessarily required for the implementation of the solution or solutions of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.
In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.
In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the solution of the embodiment of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.
In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned Memory circuit or Memory device may be any suitable Memory medium (including magnetic Memory medium or magneto-optical Memory medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.
The foregoing may be better understood in light of the following clauses:
clause 1, a data processing apparatus, comprising:
control circuitry configured to parse a sparse instruction, the sparse instruction indicating an operation related to structured sparsity and at least one operand of the sparse instruction including at least one descriptor indicating at least one of the following information: shape information of tensor data and spatial information of tensor data;
a tensor interface circuit configured to parse the descriptors;
a storage circuit configured to store pre-sparsification and/or post-sparsification information; and
an arithmetic circuit configured to perform a corresponding operation according to the sparse instruction based on the parsed descriptor.
Clause 2, the data processing apparatus according to clause 1, wherein,
the tensor interface circuit is configured to determine a data address of tensor data corresponding to the operand in a data storage space according to the shape information; and/or
The tensor interface circuit is configured to determine a dependency relationship between instructions according to the spatial information.
Clause 3, the data processing apparatus according to any one of clauses 1 to 2, wherein the shape information of the tensor data includes at least one shape parameter indicating a shape of the N-dimensional tensor data, N being a positive integer, the shape parameter of the tensor data includes at least one of:
the size of a data storage space where the tensor data are located in at least one of N dimensional directions, the size of a storage area of the tensor data in at least one of the N dimensional directions, the offset of the storage area in at least one of the N dimensional directions, the positions of at least two vertexes located at diagonal positions of the N dimensional directions relative to a data reference point, and the mapping relation between the data description position of the tensor data and a data address.
Clause 4, the data processing apparatus according to any of clauses 1-2, wherein the shape information of the tensor data indicates at least one shape parameter of a shape of N-dimensional tensor data including a plurality of data blocks, N being a positive integer, the shape parameter including at least one of:
the size of a data storage space where the tensor data are located in at least one of N dimension directions, the size of a storage area of a single data block in at least one of the N dimension directions, the block step size of the data block in at least one of the N dimension directions, the number of the data blocks in at least one of the N dimension directions, and the overall step size of the data block in at least one of the N dimension directions.
Clause 5, the data processing apparatus of any of clauses 1-4, wherein the sparse instruction indicates a first mode of operation and an operand of the sparse instruction includes data to be thinned out,
the operation circuit is configured to perform structured sparse processing on the data to be thinned according to the sparse instruction, and output a thinned structure to the storage circuit, where the structure includes a data portion and an index portion, the data portion includes the data of the data to be thinned after the sparse processing, and the index portion is used to indicate a position of the data to be thinned in the data to be thinned.
Clause 6, the data processing apparatus according to any of clauses 1-4, wherein the sparse instruction indicates a second mode of operation and an operand of the sparse instruction comprises data to be thinned out,
the operation circuit is configured to execute structured sparse processing on the data to be thinned according to the sparse instruction, and output a thinned data portion to the storage circuit, where the data portion includes the data of the data to be thinned after the thinning processing.
Clause 7, the data processing apparatus of any of clauses 1-4, wherein the sparse instruction indicates a third mode of operation and an operand of the sparse instruction comprises data to be thinned out,
the operation circuit is configured to perform structured sparse processing on the data to be thinned according to the sparse instruction, and output an index part subjected to sparse processing to the storage circuit, wherein the index part indicates the position of the data subjected to sparse processing in the data to be thinned.
Clause 8, the data processing apparatus according to any of clauses 1-4, wherein the sparse instruction indicates a fourth mode of operation, and operands of the sparse instruction include data to be thinned out and a sparse index indicating a location of a valid data element in a structured sparse to be performed,
the operation circuit is configured to perform structured sparse processing on the data to be thinned according to the sparse instruction and the position indicated by the sparse index, and output a thinned structure or a thinned data part to the storage circuit, where the structure includes a data part and an index part that are bound to each other, the data part includes the data of the data to be thinned after being thinned, and the index part is used to indicate the position of the data to be thinned in the data to be thinned.
Clause 9, the data processing apparatus according to any of clauses 1-4, wherein the sparse instruction indicates a fifth mode of operation and operands of the sparse instruction include a sparsified data portion and a corresponding index portion, the index portion indicating a location of the data portion in the data prior to the sparsification,
the arithmetic circuit is configured to bind the data portion and the index portion into a structure according to the sparse instruction, and output the structure to the storage circuit.
Clause 10, the data processing apparatus according to any of clauses 1-4, wherein the sparse instruction indicates a sixth mode of operation and operands of the sparse instruction include a thinned-out data portion and a corresponding index portion, the index portion indicating a location of the data portion in the data prior to thinning-out,
the arithmetic circuit is configured to perform, according to the sparse instruction and according to the position indicated by the index portion, anti-sparsification processing on the data portion to generate recovered data in a data format before sparsification processing, and output the recovered data to the storage circuit.
Clause 11, the data processing apparatus according to any of clauses 5-8, wherein the structured sparsification comprises selecting n data elements from every m data elements as valid data elements, wherein m > n.
Clause 12, the data processing apparatus of clause 11, wherein the arithmetic circuitry further comprises: at least one multi-stage pipelined arithmetic circuit including a plurality of operators arranged stage by stage and configured to perform structured thinning-out processing of selecting n data elements having larger absolute values from among the m data elements as effective data elements in accordance with the thinning-out instruction.
Clause 13, the data processing apparatus of clause 12, wherein the multi-stage pipelined arithmetic circuit comprises four pipelined stages, wherein:
the first flow level comprises m absolute value calculation devices which are used for respectively taking absolute values of m data elements to be thinned so as to generate m absolute values;
the second pipeline stage comprises a permutation and combination circuit, which is used for permutation and combination of the m absolute values to generate m groups of data, wherein each group of data comprises the m absolute values and the positions of the m absolute values in each group of data are different from each other;
the third pipeline stage comprises m paths of comparison circuits for comparing absolute values in the m groups of data and generating comparison results; and
the fourth pipeline stage comprises a screening circuit for selecting n data elements with larger absolute values as valid data elements according to the comparison result, and outputting the valid data elements and corresponding indexes, wherein the indexes indicate the positions of the valid data elements in the m data elements.
Clause 14, the data processing apparatus according to clause 13, wherein each of the comparison circuits in the third pipeline stage includes m-1 comparators, m-1 comparators in the ith pipeline comparison circuit are configured to sequentially compare one absolute value in the ith group of data with the other three absolute values and generate a comparison result, and i is greater than or equal to 1 and is less than or equal to m.
Clause 15, the data processing apparatus according to any one of clauses 13-14, wherein the filtering circuitry is further configured to select according to a specified priority order when there are data elements that are the same in absolute value.
Clause 16, the data processing apparatus of clause 10, wherein the anti-sparsification process comprises:
according to the position indicated by the index part, according to the data format before sparsifying, each data element in the data part is respectively placed at the corresponding position of the data format before sparsifying, and the rest positions of the data format are filled with predetermined information to generate the recovery data.
Clause 17, the data processing apparatus according to clause 5, 8 or 9, wherein
Each 1 bit in the index part in the structure body corresponds to the position of N bits of data, and N is determined at least partially based on hardware configuration; and/or
The data portions in the structure are aligned according to a first alignment requirement and the index portions in the structure are aligned according to a second alignment requirement.
Clause 18, the data processing apparatus of any of clauses 1-17, wherein the sparseness instruction is for structured sparseness processing of at least one dimension of the multidimensional data in the neural network.
Clause 19, the data processing apparatus of clause 18, wherein the at least one dimension is selected from an input channel dimension and an output channel dimension.
Clause 20, the data processing apparatus according to any of clauses 1-19, wherein
The sparse instruction includes an operation mode bit therein to indicate an operation mode of the sparse instruction, or
The sparse instruction includes a plurality of instructions, each instruction corresponding to one or more different operating modes.
Clause 21, a chip comprising the data processing apparatus of any of clauses 1-20.
Clause 22, a board comprising the chip of clause 21.
Clause 23, a data processing method, comprising:
parsing a sparse instruction, the sparse instruction indicating an operation related to structured sparsity and at least one operand of the sparse instruction including at least one descriptor indicating at least one of the following information: shape information of tensor data and spatial information of tensor data;
parsing the descriptor;
reading a corresponding operand based at least in part on the parsed descriptor;
performing the structured sparsity-related operation on the operands; and
and outputting the operation result.
Clause 24, the data processing method of clause 23, wherein parsing the descriptor comprises:
according to the shape information, determining the data address of tensor data corresponding to the operand in a data storage space; and/or
And determining the dependency relationship between the instructions according to the spatial information.
Clause 25, the data processing method according to any of clauses 23-24, wherein the shape information of the tensor data includes at least one shape parameter representing a shape of the N-dimensional tensor data, N being a positive integer, the shape parameter of the tensor data includes at least one of:
the size of a data storage space where the tensor data are located in at least one of N dimensional directions, the size of a storage area of the tensor data in at least one of the N dimensional directions, the offset of the storage area in at least one of the N dimensional directions, the positions of at least two vertexes located at diagonal positions of the N dimensional directions relative to a data reference point, and the mapping relation between the data description position of the tensor data and a data address.
Clause 26, the data processing method according to any of clauses 23-24, wherein the shape information of the tensor data indicates at least one shape parameter of a shape of N-dimensional tensor data comprising a plurality of data blocks, N being a positive integer, the shape parameter including at least one of:
the size of a data storage space where the tensor data are located in at least one of N dimension directions, the size of a storage area of a single data block in at least one of the N dimension directions, the block step size of the data block in at least one of the N dimension directions, the number of the data blocks in at least one of the N dimension directions, and the overall step size of the data block in at least one of the N dimension directions.
Clause 27, the method of data processing according to any of clauses 23-26, wherein the sparse instruction indicates a first mode of operation and an operand of the sparse instruction comprises data to be thinned out, the method further comprising:
according to the sparse instruction, performing structured sparse processing on the data to be thinned; and
outputting a sparsified structure, wherein the structure comprises a data part and an index part which are bound with each other, the data part comprises the data of the data to be sparsified after the sparsification processing, and the index part is used for indicating the position of the data to be sparsified in the data to be sparsified.
Clause 28, the method of data processing according to any of clauses 23-26, wherein the sparse instruction indicates a second mode of operation and an operand of the sparse instruction comprises data to be thinned out, the method further comprising:
according to the sparse instruction, performing structured sparse processing on the data to be thinned; and
and outputting a data part after the sparsification treatment, wherein the data part comprises the data of the data to be sparsified after the sparsification treatment.
Clause 29, the method of data processing according to any of clauses 23-26, wherein the sparse instruction indicates a third mode of operation and an operand of the sparse instruction comprises data to be thinned out, the method further comprising:
according to the sparse instruction, performing structured sparse processing on the data to be thinned; and
outputting a thinned index part, wherein the index part indicates the position of the thinned data in the data to be thinned.
Clause 30, the data processing method according to any of clauses 23-26, wherein the sparse instruction indicates a fourth mode of operation and operands of the sparse instruction include data to be thinned out and a sparse index indicating a location of a valid data element in a structured sparse to be performed, the method further comprising:
according to the sparse instruction and the position indicated by the sparse index, performing structured sparse processing on the data to be thinned; and
outputting a sparsified structural body or a sparsified data part, wherein the structural body comprises a data part and an index part which are bound with each other, the data part comprises the data of the data to be sparsified, and the index part is used for indicating the position of the data to be sparsified in the data to be sparsified.
Clause 31, the data processing method according to clauses 23-26, wherein the sparse instruction indicates a fifth mode of operation, and operands of the sparse instruction comprise a sparsified data portion and a corresponding index portion, the index portion indicating a location of the data portion in the data prior to the sparsifying, the method further comprising:
binding the data part and the index part into a structure according to the sparse instruction; and
and outputting the structural body.
Clause 32, the method of data processing according to clauses 23-26, wherein the sparse instruction indicates a sixth mode of operation and operands of the sparse instruction include a sparsified data portion and a corresponding index portion, the index portion indicating a location of the data portion in the data prior to the sparsifying, the method further comprising:
according to the sparse instruction, according to the position indicated by the index part, performing anti-sparsification processing on the data part to generate recovery data with a data format before sparsification processing; and
and outputting the recovered data.
Clause 33, the data processing method according to any one of clauses 27-30, wherein the structured sparsification comprises selecting n data elements from every m data elements as valid data elements, wherein m > n.
Clause 34, the data processing method of clause 33, wherein the structured sparsification is implemented using an arithmetic circuit comprising: at least one multi-stage pipelined arithmetic circuit including a plurality of operators arranged stage by stage and configured to perform structured thinning-out processing of selecting n data elements having larger absolute values from among the m data elements as effective data elements in accordance with the thinning-out instruction.
Clause 35, the data processing method of clause 34, wherein the multi-stage pipelined arithmetic circuit comprises four pipelined stages, wherein:
the first flow level comprises m absolute value calculation devices which are used for respectively taking absolute values of m data elements to be thinned so as to generate m absolute values;
the second pipeline stage comprises a permutation and combination circuit, which is used for permutation and combination of the m absolute values to generate m groups of data, wherein each group of data comprises the m absolute values and the positions of the m absolute values in each group of data are different from each other;
the third pipeline stage comprises m paths of comparison circuits for comparing absolute values in the m groups of data and generating comparison results; and
the fourth pipeline stage comprises a screening circuit for selecting n data elements with larger absolute values as valid data elements according to the comparison result, and outputting the valid data elements and corresponding indexes, wherein the indexes indicate the positions of the valid data elements in the m data elements.
Clause 36 and clause 35, the data processing method according to clause 35, wherein each of the comparison circuits in the third pipeline stage includes m-1 comparators, and m-1 comparators in the ith comparison circuit are configured to sequentially compare one absolute value in the ith group of data with the other three absolute values and generate a comparison result, where i is greater than or equal to 1 and is less than or equal to m.
Clause 37, the data processing method according to any of clauses 35-36, wherein the filtering circuit is further configured to select according to a specified priority order when there are data elements that are identical in absolute value.
Clause 38, the data processing method of clause 32, wherein the anti-sparsification process comprises:
according to the position indicated by the index part, according to the data format before sparsifying, each data element in the data part is respectively placed at the corresponding position of the data format before sparsifying, and the rest positions of the data format are filled with predetermined information to generate the recovery data.
Clause 39, the data processing method of clause 27, 30 or 31, wherein:
each 1 bit in the index part in the structure body corresponds to the position of N bits of data, and N is determined at least partially based on hardware configuration; and/or
The data portions in the structure are aligned according to a first alignment requirement and the index portions in the structure are aligned according to a second alignment requirement.
Clause 40, the data processing method of any of clauses 23-39, wherein the sparse instruction is for structured sparse processing of at least one dimension of multidimensional data in a neural network.
Clause 41, the data processing method of clause 41, wherein the at least one dimension is selected from the group consisting of an input channel dimension and an output channel dimension.
Clause 42, the data processing method according to any of clauses 23-41, wherein
The sparse instruction includes an operation mode bit to indicate an operation mode of the sparse instruction, or the sparse instruction includes a plurality of instructions, each instruction corresponding to one or more different operation modes.
The foregoing detailed description of the disclosed embodiments has been presented to enable one of ordinary skill in the art to make and use the principles and implementations of the present disclosure; meanwhile, for the person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as limiting the present disclosure.

Claims (42)

1. A data processing apparatus comprising:
control circuitry configured to parse a sparse instruction, the sparse instruction indicating an operation related to structured sparsity and at least one operand of the sparse instruction including at least one descriptor indicating at least one of the following information: shape information of tensor data and spatial information of tensor data;
a tensor interface circuit configured to parse the descriptors;
a storage circuit configured to store pre-sparsification and/or post-sparsification information; and
an arithmetic circuit configured to perform a corresponding operation according to the sparse instruction based on the parsed descriptor.
2. The data processing apparatus according to claim 1,
the tensor interface circuit is configured to determine a data address of tensor data corresponding to the operand in a data storage space according to the shape information; and/or
The tensor interface circuit is configured to determine a dependency relationship between instructions according to the spatial information.
3. The data processing apparatus according to any one of claims 1 to 2, wherein the shape information of the tensor data includes at least one shape parameter representing a shape of N-dimensional tensor data, N being a positive integer, the shape parameter of the tensor data including at least one of:
the size of a data storage space where the tensor data are located in at least one of N dimensional directions, the size of a storage area of the tensor data in at least one of the N dimensional directions, the offset of the storage area in at least one of the N dimensional directions, the positions of at least two vertexes located at diagonal positions of the N dimensional directions relative to a data reference point, and the mapping relation between the data description position of the tensor data and a data address.
4. The data processing apparatus of any of claims 1-2, wherein the shape information of the tensor data indicates at least one shape parameter of a shape of N-dimensional tensor data comprising a plurality of data blocks, N being a positive integer, the shape parameter comprising at least one of:
the size of a data storage space where the tensor data are located in at least one of N dimension directions, the size of a storage area of a single data block in at least one of the N dimension directions, the block step size of the data block in at least one of the N dimension directions, the number of the data blocks in at least one of the N dimension directions, and the overall step size of the data block in at least one of the N dimension directions.
5. The data processing apparatus according to any of claims 1-4, wherein the sparse instruction indicates a first mode of operation and operands of the sparse instruction comprise data to be thinned out,
the operation circuit is configured to perform structured sparse processing on the data to be thinned according to the sparse instruction, and output a thinned structure to the storage circuit, where the structure includes a data portion and an index portion, the data portion includes the data of the data to be thinned after the sparse processing, and the index portion is used to indicate a position of the data to be thinned in the data to be thinned.
6. The data processing apparatus according to any of claims 1-4, wherein the sparse instruction indicates a second mode of operation and operands of the sparse instruction comprise data to be thinned out,
the operation circuit is configured to execute structured sparse processing on the data to be thinned according to the sparse instruction, and output a thinned data portion to the storage circuit, where the data portion includes the data of the data to be thinned after the thinning processing.
7. The data processing apparatus according to any of claims 1-4, wherein the sparse instruction indicates a third mode of operation and operands of the sparse instruction comprise data to be thinned out,
the operation circuit is configured to perform structured sparse processing on the data to be thinned according to the sparse instruction, and output an index part subjected to sparse processing to the storage circuit, wherein the index part indicates the position of the data subjected to sparse processing in the data to be thinned.
8. The data processing apparatus according to any of claims 1-4, wherein the sparse instruction indicates a fourth mode of operation, and operands of the sparse instruction comprise data to be thinned out and a sparse index indicating a position of a valid data element in a structured sparse to be performed,
the operation circuit is configured to perform structured sparse processing on the data to be thinned according to the sparse instruction and the position indicated by the sparse index, and output a thinned structure or a thinned data part to the storage circuit, where the structure includes a data part and an index part that are bound to each other, the data part includes the data of the data to be thinned after being thinned, and the index part is used to indicate the position of the data to be thinned in the data to be thinned.
9. The data processing apparatus according to any of claims 1-4, wherein the sparse instruction indicates a fifth mode of operation, and operands of the sparse instruction comprise a thinned-out data portion and a corresponding index portion, the index portion indicating a location of the data portion in the data prior to thinning out,
the arithmetic circuit is configured to bind the data portion and the index portion into a structure according to the sparse instruction, and output the structure to the storage circuit.
10. The data processing apparatus according to any of claims 1-4, wherein the sparse instruction indicates a sixth mode of operation, and operands of the sparse instruction comprise a thinned-out data portion and a corresponding index portion, the index portion indicating a location of the data portion in the data prior to thinning out,
the arithmetic circuit is configured to perform, according to the sparse instruction and according to the position indicated by the index portion, anti-sparsification processing on the data portion to generate recovered data having a data format before sparsification processing, and output the recovered data to the storage circuit.
11. The data processing apparatus according to any of claims 5-8, wherein said structured sparseness processing comprises selecting n data elements from every m data elements as valid data elements, wherein m > n.
12. The data processing device of claim 11, wherein the arithmetic circuitry further comprises: at least one multi-stage pipelined arithmetic circuit including a plurality of operators arranged stage by stage and configured to perform structured thinning-out processing of selecting n data elements having larger absolute values from among the m data elements as effective data elements in accordance with the thinning-out instruction.
13. The data processing apparatus of claim 12, wherein the multi-stage pipelined arithmetic circuit comprises four pipelined stages, wherein:
the first flow level comprises m absolute value calculation devices which are used for respectively taking absolute values of m data elements to be thinned so as to generate m absolute values;
the second pipeline stage comprises a permutation and combination circuit, which is used for permutation and combination of the m absolute values to generate m groups of data, wherein each group of data comprises the m absolute values and the positions of the m absolute values in each group of data are different from each other;
the third pipeline stage comprises m paths of comparison circuits for comparing absolute values in the m groups of data and generating comparison results; and
the fourth pipeline stage comprises a screening circuit for selecting n data elements with larger absolute values as valid data elements according to the comparison result, and outputting the valid data elements and corresponding indexes, wherein the indexes indicate the positions of the valid data elements in the m data elements.
14. The data processing apparatus of claim 13, wherein each of the comparison circuits in the third pipeline stage comprises m-1 comparators, and m-1 comparators in the ith pipeline comparison circuit are configured to sequentially compare one absolute value in the ith group of data with three other absolute values and generate comparison results, 1 ≦ i ≦ m.
15. A data processing apparatus according to any of claims 13 to 14, wherein the screening circuit is further configured to select in a specified priority order when there are data elements of the same absolute value.
16. The data processing device of claim 10, wherein the anti-sparsification process comprises:
according to the position indicated by the index part, according to the data format before sparsifying, each data element in the data part is respectively placed at the corresponding position of the data format before sparsifying, and the rest positions of the data format are filled with predetermined information to generate the recovery data.
17. A data processing apparatus as claimed in claim 5, 8 or 9, wherein
Each 1 bit in the index part in the structure body corresponds to the position of N bits of data, and N is determined at least partially based on hardware configuration; and/or
The data portions in the structure are aligned according to a first alignment requirement and the index portions in the structure are aligned according to a second alignment requirement.
18. The data processing apparatus according to any of claims 1-17, wherein the sparse instruction is for structured sparse processing of at least one dimension of multidimensional data in a neural network.
19. The data processing apparatus of claim 18, wherein the at least one dimension is selected from an input channel dimension and an output channel dimension.
20. A data processing apparatus as claimed in any one of claims 1 to 19, wherein
The sparse instruction includes an operation mode bit therein to indicate an operation mode of the sparse instruction, or
The sparse instructions include a plurality of instructions, each instruction corresponding to one or more different operating modes.
21. A chip comprising a data processing device according to any one of claims 1 to 20.
22. A board card comprising the chip of claim 21.
23. A method of data processing, comprising:
parsing a sparse instruction, the sparse instruction indicating an operation related to structured sparsity and at least one operand of the sparse instruction including at least one descriptor indicating at least one of: shape information of tensor data and spatial information of tensor data;
parsing the descriptor;
reading a corresponding operand based at least in part on the parsed descriptor;
performing the structured sparsity-related operation on the operands; and
and outputting the operation result.
24. The data processing method of claim 23, wherein parsing the descriptor comprises:
according to the shape information, determining the data address of tensor data corresponding to the operand in a data storage space; and/or
And determining the dependency relationship between the instructions according to the spatial information.
25. The data processing method of any of claims 23-24, wherein the shape information of the tensor data includes at least one shape parameter representing a shape of the N-dimensional tensor data, N being a positive integer, the shape parameter of the tensor data including at least one of:
the size of a data storage space where the tensor data are located in at least one of N dimensional directions, the size of a storage area of the tensor data in at least one of the N dimensional directions, the offset of the storage area in at least one of the N dimensional directions, the positions of at least two vertexes located at diagonal positions of the N dimensional directions relative to a data reference point, and the mapping relation between the data description position of the tensor data and a data address.
26. The data processing method of any of claims 23-24, wherein the shape information of the tensor data indicates at least one shape parameter of a shape of N-dimensional tensor data comprising a plurality of data blocks, N being a positive integer, the shape parameter comprising at least one of:
the size of a data storage space where the tensor data are located in at least one of N dimension directions, the size of a storage area of a single data block in at least one of the N dimension directions, the block step size of the data block in at least one of the N dimension directions, the number of the data blocks in at least one of the N dimension directions, and the overall step size of the data block in at least one of the N dimension directions.
27. The data processing method of any of claims 23 to 26, wherein the sparse instruction indicates a first mode of operation and an operand of the sparse instruction comprises data to be thinned out, the method further comprising:
according to the sparse instruction, performing structured sparse processing on the data to be thinned; and
outputting a sparsified structure, wherein the structure comprises a data part and an index part which are bound with each other, the data part comprises the data of the data to be sparsified after the sparsification processing, and the index part is used for indicating the position of the data to be sparsified in the data to be sparsified.
28. The data processing method of any of claims 23 to 26, wherein the sparse instruction indicates a second mode of operation and an operand of the sparse instruction comprises data to be thinned out, the method further comprising:
according to the sparse instruction, performing structured sparse processing on the data to be thinned; and
and outputting a data part after the sparsification treatment, wherein the data part comprises the data of the data to be sparsified after the sparsification treatment.
29. The data processing method of any of claims 23 to 26, wherein the sparse instruction indicates a third mode of operation and operands of the sparse instruction comprise data to be thinned out, the method further comprising:
according to the sparse instruction, performing structured sparse processing on the data to be thinned; and
outputting a thinned index part, wherein the index part indicates the position of the thinned data in the data to be thinned.
30. The data processing method according to any of claims 23-26, wherein the sparse instruction indicates a fourth mode of operation and operands of the sparse instruction comprise data to be thinned out and a sparse index indicating a position of a significant data element in a structured sparse to be performed, the method further comprising:
according to the sparse instruction and the position indicated by the sparse index, performing structured sparse processing on the data to be thinned; and
outputting a sparsified structural body or a sparsified data part, wherein the structural body comprises a data part and an index part which are bound with each other, the data part comprises the data of the data to be sparsified, and the index part is used for indicating the position of the data to be sparsified in the data to be sparsified.
31. The method of data processing according to claims 23-26, wherein the sparse instruction indicates a fifth mode of operation and operands of the sparse instruction comprise a thinned-out data portion and a corresponding index portion, the index portion indicating a location of the data portion in the data prior to thinning out, the method further comprising:
binding the data part and the index part into a structure according to the sparse instruction; and
and outputting the structural body.
32. The method of data processing according to claims 23-26, wherein the sparse instruction indicates a sixth mode of operation and operands of the sparse instruction comprise a thinned-out data portion and a corresponding index portion, the index portion indicating a location of the data portion in the data prior to thinning out, the method further comprising:
according to the sparse instruction, according to the position indicated by the index part, performing anti-sparsification processing on the data part to generate recovery data with a data format before sparsification processing; and
and outputting the recovered data.
33. A method of data processing according to any of claims 27-30, wherein said structured sparseness processing comprises selecting n data elements from every m data elements as valid data elements, wherein m > n.
34. The data processing method of claim 33, wherein the structured sparseness processing is implemented using an operational circuit comprising: at least one multi-stage pipeline operation circuit including a plurality of operators arranged stage by stage and configured to perform structured thinning-out processing of selecting n data elements having larger absolute values as effective data elements from among the m data elements according to the thinning-out instruction.
35. The data processing method of claim 34, wherein the multi-stage pipelined arithmetic circuit comprises four pipelined stages, wherein:
the first flow level comprises m absolute value calculation devices which are used for respectively taking absolute values of m data elements to be thinned so as to generate m absolute values;
the second pipeline stage comprises a permutation and combination circuit, which is used for permutation and combination of the m absolute values to generate m groups of data, wherein each group of data comprises the m absolute values and the positions of the m absolute values in each group of data are different from each other;
the third pipeline stage comprises m paths of comparison circuits for comparing absolute values in the m groups of data and generating comparison results; and
the fourth pipeline stage comprises a screening circuit for selecting n data elements with larger absolute values as valid data elements according to the comparison result, and outputting the valid data elements and corresponding indexes, wherein the indexes indicate the positions of the valid data elements in the m data elements.
36. The data processing method of claim 35, wherein each of the comparison circuits in the third pipeline stage comprises m-1 comparators, and m-1 comparators in the ith pipeline comparison circuit are configured to sequentially compare one absolute value in the ith group of data with the other three absolute values and generate comparison results, 1 ≦ i ≦ m.
37. A data processing method according to any of claims 35 to 36, wherein the screening circuit is further configured to select in a specified priority order when there are data elements of the same absolute value.
38. The data processing method of claim 32, wherein the anti-sparsification process comprises:
according to the position indicated by the index part, according to the data format before sparsifying, each data element in the data part is respectively placed at the corresponding position of the data format before sparsifying, and the rest positions of the data format are filled with predetermined information to generate the recovery data.
39. The data processing method of claim 27, 30 or 31, wherein:
each 1 bit in the index part in the structure body corresponds to the position of N bits of data, and N is determined at least partially based on hardware configuration; and/or
The data portions in the structure are aligned according to a first alignment requirement and the index portions in the structure are aligned according to a second alignment requirement.
40. The data processing method of any of claims 23 to 39, wherein the sparse instruction is for structured sparse processing of at least one dimension of multidimensional data in a neural network.
41. The data processing method of claim 40, wherein the at least one dimension is selected from an input channel dimension and an output channel dimension.
42. A data processing method as claimed in any one of claims 23 to 41, wherein
The sparse instruction includes an operation mode bit therein to indicate an operation mode of the sparse instruction, or
The sparse instruction includes a plurality of instructions, each instruction corresponding to one or more different operating modes.
CN202011563257.XA 2020-12-25 2020-12-25 Data processing device, data processing method and related product Pending CN114692841A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011563257.XA CN114692841A (en) 2020-12-25 2020-12-25 Data processing device, data processing method and related product
PCT/CN2021/128189 WO2022134873A1 (en) 2020-12-25 2021-11-02 Data processing device, data processing method, and related product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011563257.XA CN114692841A (en) 2020-12-25 2020-12-25 Data processing device, data processing method and related product

Publications (1)

Publication Number Publication Date
CN114692841A true CN114692841A (en) 2022-07-01

Family

ID=82130877

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011563257.XA Pending CN114692841A (en) 2020-12-25 2020-12-25 Data processing device, data processing method and related product

Country Status (1)

Country Link
CN (1) CN114692841A (en)

Similar Documents

Publication Publication Date Title
WO2023045445A1 (en) Data processing device, data processing method, and related product
WO2022134873A1 (en) Data processing device, data processing method, and related product
WO2023045446A1 (en) Computing apparatus, data processing method, and related product
CN112084023A (en) Data parallel processing method, electronic equipment and computer readable storage medium
CN114692844A (en) Data processing device, data processing method and related product
CN113469337B (en) Compiling method for optimizing neural network model and related products thereof
CN114692838A (en) Data processing device, data processing method and related product
WO2022001500A1 (en) Computing apparatus, integrated circuit chip, board card, electronic device, and computing method
CN114692841A (en) Data processing device, data processing method and related product
CN114281561A (en) Processing unit, synchronization method for a processing unit and corresponding product
WO2022134872A1 (en) Data processing apparatus, data processing method and related product
CN113867799A (en) Computing device, integrated circuit chip, board card, electronic equipment and computing method
CN114692845A (en) Data processing device, data processing method and related product
CN113469365B (en) Reasoning and compiling method based on neural network model and related products thereof
WO2022134688A1 (en) Data processing circuit, data processing method, and related products
WO2022135599A1 (en) Device, board and method for merging branch structures, and readable storage medium
WO2022257980A1 (en) Computing apparatus, method for implementing convulution operation by using computing apparatus, and related product
CN114692840A (en) Data processing device, data processing method and related product
CN113792867B (en) Arithmetic circuit, chip and board card
CN115221104A (en) Data processing device, data processing method and related product
CN114692846A (en) Data processing device, data processing method and related product
CN114692839A (en) Data processing device, data processing method and related product
WO2022135600A1 (en) Computational neural network apparatus, card, method, and readable storage medium
CN113867788A (en) Computing device, chip, board card, electronic equipment and computing method
CN115437693A (en) Computing device operating according to multi-operation instruction and single-operation instruction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination