CN114692841A

CN114692841A - Data processing device, data processing method and related product

Info

Publication number: CN114692841A
Application number: CN202011563257.XA
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2022-07-01

Abstract

The present disclosure discloses a data processing device, a data processing method and related products. The data processing apparatus may be implemented as a computing apparatus included in a combined processing apparatus, which may also include an interface apparatus and other processing apparatuses. The computing device interacts with other processing devices to jointly complete the computing operation specified by the user. The combined processing device may further include a storage device, which is respectively connected with the computing device and other processing devices, and is used for storing data of the computing device and other processing devices. The solutions of the present disclosure provide dedicated instructions for operations related to structured sparseness of tensor data, which can simplify processing and improve the processing efficiency of the machine.

Description

Data processing device, data processing method and related products

技术领域technical field

本披露一般地涉及处理器领域。更具体地，本披露涉及数据处理装置、数据处理方法、芯片和板卡。The present disclosure relates generally to the field of processors. More specifically, the present disclosure relates to a data processing apparatus, a data processing method, a chip and a board.

背景技术Background technique

近年来，随着深度学习的迅猛发展，使得计算机视觉、自然语言处理等一系列领域的算法性能都有了跨越式的进展。然而深度学习算法是一种计算密集型和存储密集型的工具，随着信息处理任务的日趋复杂，对算法实时性和准确性要求不断增高，神经网络往往会被设计得越来越深，使得其计算量和存储空间需求越来越大，导致现存的基于深度学习的人工智能技术难以直接应用在硬件资源受限的手机、卫星或嵌入式设备上。In recent years, with the rapid development of deep learning, the performance of algorithms in a series of fields such as computer vision and natural language processing has achieved leapfrog progress. However, deep learning algorithms are computationally and storage-intensive tools. With the increasingly complex information processing tasks, the real-time and accuracy requirements of the algorithms continue to increase, and neural networks are often designed to be deeper and deeper, making the The increasing demand for computing and storage space makes it difficult for existing deep learning-based artificial intelligence technologies to be directly applied to mobile phones, satellites or embedded devices with limited hardware resources.

因此，深度神经网络模型的压缩、加速、优化变得格外重要。大量的研究试着在不影响模型精度的前提下，减少神经网络的计算和存储需求，对深度学习技术在嵌入端、移动端的工程化应用具有十分重要的意义。稀疏化正是模型轻量化方法之一。Therefore, the compression, acceleration, and optimization of deep neural network models have become extremely important. A large number of studies have tried to reduce the computing and storage requirements of neural networks without affecting the accuracy of the model, which is of great significance for the engineering application of deep learning technology in embedded and mobile terminals. Sparsification is one of the methods of model lightweighting.

网络参数稀疏化是通过适当的方法减少较大网络中的冗余成分，以降低网络对计算量和存储空间的需求。现有的硬件和/或指令集不能有效地支持稀疏化处理。The network parameter sparsification is to reduce the redundant components in the larger network by appropriate methods, so as to reduce the network's demand for computation and storage space. Existing hardware and/or instruction sets cannot efficiently support sparse processing.

发明内容SUMMARY OF THE INVENTION

为了至少部分地解决背景技术中提到的一个或多个技术问题，本披露的方案提供了一种数据处理装置、数据处理方法、芯片和板卡。In order to at least partially solve one or more technical problems mentioned in the background art, the solution of the present disclosure provides a data processing apparatus, a data processing method, a chip and a board.

在第一方面中，本披露公开一种数据处理装置，包括：控制电路，其配置用于解析稀疏指令，所述稀疏指令指示与结构化稀疏相关的操作，并且所述稀疏指令的至少一个操作数包括至少一个描述符，所述描述符指示以下至少一项信息：张量数据的形状信息和张量数据的空间信息；张量接口电路，其配置用于对所述描述符进行解析；存储电路，其配置用于存储稀疏化前和/或稀疏化后的信息；以及运算电路，其配置用于基于解析的描述符，根据所述稀疏指令执行相应的操作。In a first aspect, the present disclosure discloses a data processing apparatus comprising: a control circuit configured to parse a sparse instruction, the sparse instruction indicating an operation related to structured sparse, and at least one operation of the sparse instruction The data includes at least one descriptor indicating at least one of the following information: shape information of the tensor data and spatial information of the tensor data; a tensor interface circuit configured to parse the descriptor; storage a circuit configured to store the information before and/or after the thinning; and an arithmetic circuit configured to perform a corresponding operation according to the sparse instruction based on the parsed descriptor.

在第二方面中，本披露提供一种芯片，包括前述第一方面任一实施例的数据处理装置。In a second aspect, the present disclosure provides a chip including the data processing apparatus of any embodiment of the foregoing first aspect.

在第三方面中，本披露提供一种板卡，包括前述第二方面任一实施例的芯片。In a third aspect, the present disclosure provides a board including the chip of any embodiment of the foregoing second aspect.

在第四方面中，本披露提供一种数据处理方法，该方法包括：解析稀疏指令，所述稀疏指令指示与结构化稀疏相关的操作，并且所述稀疏指令的至少一个操作数包括至少一个描述符，所述描述符指示以下至少一项信息：张量数据的形状信息和张量数据的空间信息；对所述描述符进行解析；至少部分基于解析的描述符，读取相应的操作数；对所述操作数执行所述与结构化稀疏相关的操作；以及输出操作结果。In a fourth aspect, the present disclosure provides a data processing method comprising: parsing a sparse instruction, the sparse instruction indicating an operation related to structured sparse, and at least one operand of the sparse instruction includes at least one description A descriptor, the descriptor indicating at least one of the following information: shape information of the tensor data and spatial information of the tensor data; parsing the descriptor; reading the corresponding operand based at least in part on the parsed descriptor; performing the structured sparsity-related operation on the operand; and outputting a result of the operation.

通过如上所提供的数据处理装置、数据处理方法、集成电路芯片和板卡，本披露实施例提供了一种稀疏指令，用于执行与张量数据的结构化稀疏相关的操作，其中张量数据通过描述符来描述。在一些实施例中，稀疏指令中可以包括操作模式位来指示稀疏指令的不同操作模式，从而执行不同的操作。在另一些实施例中，可以包括多条稀疏指令，每条指令对应一种或多种不同的操作模式，从而执行与结构化稀疏相关的各种操作。通过提供专门的稀疏指令来执行与张量数据的结构化稀疏相关的操作，可以简化处理，由此提高机器的处理效率。With the data processing apparatus, data processing method, integrated circuit chip, and board provided above, embodiments of the present disclosure provide a sparse instruction for performing operations related to structured sparseness of tensor data, where tensor data Described by descriptors. In some embodiments, operation mode bits may be included in the sparse instruction to indicate different operation modes of the sparse instruction to perform different operations. In other embodiments, multiple sparse instructions may be included, each instruction corresponding to one or more different operation modes, so as to perform various operations related to structured sparse. By providing specialized sparsity instructions to perform operations related to structured sparsity of tensor data, processing can be simplified, thereby increasing the processing efficiency of the machine.

附图说明Description of drawings

通过参考附图阅读下文的详细描述，本披露示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中，以示例性而非限制性的方式示出了本披露的若干实施方式，并且相同或对应的标号表示相同或对应的部分其中：The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the accompanying drawings, several embodiments of the present disclosure are shown by way of example and not limitation, and like or corresponding reference numerals refer to like or corresponding parts wherein:

图1是示出本披露实施例的板卡的结构图；FIG. 1 is a structural diagram illustrating a board according to an embodiment of the present disclosure;

图2是示出本披露实施例的组合处理装置的结构图；FIG. 2 is a block diagram illustrating a combined processing apparatus according to an embodiment of the present disclosure;

图3是示出本披露实施例的单核计算装置的内部结构示意图；3 is a schematic diagram showing the internal structure of a single-core computing device according to an embodiment of the present disclosure;

图4是示出本披露实施例的多核计算装置的内部结构示意图；FIG. 4 is a schematic diagram showing the internal structure of a multi-core computing device according to an embodiment of the present disclosure;

图5是示出本披露实施例的处理器核的内部结构示意图；FIG. 5 is a schematic diagram showing the internal structure of a processor core according to an embodiment of the present disclosure;

图6示出根据本披露实施例的数据存储空间的示意图；6 shows a schematic diagram of a data storage space according to an embodiment of the present disclosure;

图7示出根据本披露实施例的数据分块在数据存储空间中的示意图；7 shows a schematic diagram of data partitioning in a data storage space according to an embodiment of the present disclosure;

图8是示出本披露实施例的数据处理装置的结构示意图；8 is a schematic structural diagram illustrating a data processing apparatus according to an embodiment of the present disclosure;

图9A是示出本披露实施例的结构化稀疏处理的示例性流水运算电路；9A is an exemplary pipelined circuit illustrating structured sparse processing of an embodiment of the present disclosure;

图9B是示出本披露另一实施例的结构化稀疏处理的示例性流水运算电路；以及9B is an exemplary pipelined circuit illustrating structured sparse processing of another embodiment of the present disclosure; and

图10是示出本披露实施例的数据处理方法的示例性流程图。FIG. 10 is an exemplary flowchart illustrating a data processing method of an embodiment of the present disclosure.

具体实施方式Detailed ways

下面将结合本披露实施例中的附图，对本披露实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本披露一部分实施例，而不是全部的实施例。基于本披露中的实施例，本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本披露保护的范围。The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present disclosure.

应当理解，本披露的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象，而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在，但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that the terms "first", "second", "third" and "fourth" in the claims, description and drawings of the present disclosure are used to distinguish different objects, rather than to describe a specific order . The terms "comprising" and "comprising" as used in the specification and claims of this disclosure indicate the presence of the described features, integers, steps, operations, elements and/or components, but do not exclude one or more other features, integers , step, operation, element, component and/or the presence or addition of a collection thereof.

还应当理解，在本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的，而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样，除非上下文清楚地指明其它情况，否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解，在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合，并且包括这些组合。It should also be understood that the terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this disclosure and the claims, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise. It should further be understood that, as used in this disclosure and the claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.

如在本说明书和权利要求书中所使用的那样，术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。As used in this specification and in the claims, the term "if" may be contextually interpreted as "when" or "once" or "in response to determining" or "in response to detecting".

下面结合附图来详细描述本披露的具体实施方式。The specific embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

图1示出本披露实施例的一种板卡10的结构示意图。如图1所示，板卡10包括芯片101，其是一种系统级芯片(System on Chip，SoC)，或称片上系统，集成有一个或多个组合处理装置，组合处理装置是一种人工智能运算单元，用以支持各类深度学习和机器学习算法，满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求。特别是深度学习技术大量应用在云端智能领域，云端智能应用的一个显著特点是输入数据量大，对平台的存储能力和计算能力有很高的要求，此实施例的板卡10适用在云端智能应用，具有庞大的片外存储、片上存储和强大的计算能力。FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. As shown in FIG. 1 , the board 10 includes a chip 101 , which is a system-on-chip (SoC), or a system-on-chip, and integrates one or more combined processing devices. The combined processing device is an artificial The intelligent computing unit is used to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining. In particular, deep learning technology is widely used in the field of cloud intelligence. A notable feature of cloud intelligence applications is the large amount of input data, which has high requirements on the storage capacity and computing capacity of the platform. The board 10 in this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and powerful computing power.

芯片101通过对外接口装置102与外部设备103相连接。外部设备103例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备103通过对外接口装置102传递至芯片101。芯片101的计算结果可以经由对外接口装置102传送回外部设备103。根据不同的应用场景，对外接口装置102可以具有不同的接口形式，例如PCIe接口等。The chip 101 is connected to an external device 103 through an external interface device 102 . The external device 103 is, for example, a server, a computer, a camera, a monitor, a mouse, a keyboard, a network card or a wifi interface, and the like. The data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 . The calculation result of the chip 101 can be transmitted back to the external device 103 via the external interface device 102 . According to different application scenarios, the external interface device 102 may have different interface forms, such as a PCIe interface and the like.

板卡10还包括用于存储数据的存储器件104，其包括一个或多个存储单元105。存储器件104通过总线与控制器件106和芯片101进行连接和数据传输。板卡10中的控制器件106配置用于对芯片101的状态进行调控。为此，在一个应用场景中，控制器件106可以包括单片机(Micro Controller Unit，MCU)。The board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105 . The storage device 104 is connected to the control device 106 and the chip 101 through a bus and performs data transmission. The control device 106 in the board 10 is configured to control the state of the chip 101 . To this end, in an application scenario, the control device 106 may include a microcontroller (Micro Controller Unit, MCU).

图2是示出此实施例的芯片101中的组合处理装置的结构图。如图2中所示，组合处理装置20包括计算装置201、接口装置202、处理装置203和存储装置204。FIG. 2 is a block diagram showing a combined processing device in the chip 101 of this embodiment. As shown in FIG. 2 , the combined processing device 20 includes a computing device 201 , an interface device 202 , a processing device 203 and a storage device 204 .

计算装置201配置成执行用户指定的操作，主要实现为单核智能处理器或者多核智能处理器，用以执行深度学习或机器学习的计算，其可以通过接口装置202与处理装置203进行交互，以共同完成用户指定的操作。The computing device 201 is configured to perform operations specified by the user, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations, which can interact with the processing device 203 through the interface device 202 to Work together to complete a user-specified operation.

接口装置202用于在计算装置201与处理装置203间传输数据和控制指令。例如，计算装置201可以经由接口装置202从处理装置203中获取输入数据，写入计算装置201片上的存储装置。进一步，计算装置201可以经由接口装置202从处理装置203中获取控制指令，写入计算装置201片上的控制缓存中。替代地或可选地，接口装置202也可以读取计算装置201的存储装置中的数据并传输给处理装置203。The interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203 . For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write the input data into the storage device on-chip of the computing device 201 . Further, the computing device 201 can obtain the control instruction from the processing device 203 via the interface device 202 and write it into the control cache on the computing device 201 . Alternatively or alternatively, the interface device 202 can also read the data in the storage device of the computing device 201 and transmit it to the processing device 203 .

处理装置203作为通用的处理装置，执行包括但不限于数据搬运、对计算装置201的开启和/或停止等基本控制。根据实现方式的不同，处理装置203可以是中央处理器(central processing unit，CPU)、图形处理器(graphics processing unit，GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器，这些处理器包括但不限于数字信号处理器(digital signal processor，DSP)、专用集成电路(application specificintegrated circuit，ASIC)、现场可编程门阵列(field-programmable gate array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等，并且其数目可以根据实际需要来确定。如前所述，仅就本披露的计算装置201而言，其可以视为具有单核结构或者同构多核结构。然而，当将计算装置201和处理装置203整合共同考虑时，二者视为形成异构多核结构。The processing device 203, as a general processing device, performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of central processing unit (CPU), graphics processing unit (GPU), or other general-purpose and/or special-purpose processors Processors, these processors include but are not limited to digital signal processors (DSPs), application specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) or other Programming logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs. As mentioned above, only for the computing device 201 of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when the computing device 201 and the processing device 203 are considered together, the two are considered to form a heterogeneous multi-core structure.

存储装置204用以存储待处理的数据，其可以是DRAM，为DDR内存，大小通常为16G或更大，用于保存计算装置201和/或处理装置203的数据。The storage device 204 is used to store data to be processed, which may be a DRAM, or a DDR memory, with a size of 16G or more, and is used to save the data of the computing device 201 and/or the processing device 203 .

图3示出了计算装置201为单核的内部结构示意图。单核计算装置301用以处理计算机视觉、语音、自然语言、数据挖掘等输入数据，单核计算装置301包括三大模块：控制模块31、运算模块32及存储模块33。FIG. 3 shows a schematic diagram of the internal structure of the computing device 201 as a single core. The single-core computing device 301 is used to process input data such as computer vision, speech, natural language, and data mining. The single-core computing device 301 includes three modules: a control module 31 , an arithmetic module 32 and a storage module 33 .

控制模块31用以协调并控制运算模块32和存储模块33的工作，以完成深度学习的任务，其包括取指单元(instruction fetch unit，IFU)311及指令译码单元(instructiondecode unit，IDU)312。取指单元311用以获取来自处理装置203的指令，指令译码单元312则将获取的指令进行译码，并将译码结果作为控制信息发送给运算模块32和存储模块33。The control module 31 is used to coordinate and control the work of the operation module 32 and the storage module 33 to complete the task of deep learning, and it includes an instruction fetch unit (IFU) 311 and an instruction decoding unit (instruction decode unit, IDU) 312 . The instruction fetching unit 311 is used to acquire the instruction from the processing device 203 , and the instruction decoding unit 312 decodes the acquired instruction, and sends the decoding result to the operation module 32 and the storage module 33 as control information.

运算模块32包括向量运算单元321及矩阵运算单元322。向量运算单元321用以执行向量运算，可支持向量乘、加、非线性变换等复杂运算；矩阵运算单元322负责深度学习算法的核心计算，即矩阵乘及卷积。The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322 . The vector operation unit 321 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, that is, matrix multiplication and convolution.

存储模块33用来存储或搬运相关数据，包括神经元存储单元(neuron RAM，NRAM)331、参数存储单元(weight RAM，WRAM)332、直接内存访问模块(direct memory access，DMA)333。NRAM 331用以存储输入神经元、输出神经元和计算后的中间结果；WRAM 332则用以存储深度学习网络的卷积核，即权值；DMA 333通过总线34连接DRAM 204，负责单核计算装置301与DRAM 204间的数据搬运。The storage module 33 is used to store or transport related data, and includes a neuron storage unit (neuron RAM, NRAM) 331 , a parameter storage unit (weight RAM, WRAM) 332 , and a direct memory access (direct memory access, DMA) 333 . NRAM 331 is used to store input neurons, output neurons and intermediate results after calculation; WRAM 332 is used to store convolution kernels of deep learning networks, that is, weights; DMA 333 is connected to DRAM 204 through bus 34, responsible for single-core computing Data transfer between device 301 and DRAM 204 .

图4示出了计算装置201为多核的内部结构示意图。多核计算装置41采用分层结构设计，多核计算装置41作为一个片上系统，其包括至少一个集群(cluster)，每个集群又包括多个处理器核，换言之，多核计算装置41是以片上系统-集群-处理器核的层次所构成的。FIG. 4 shows a schematic diagram of the internal structure of the computing device 201 with multiple cores. The multi-core computing device 41 adopts a layered structure design, and the multi-core computing device 41 is a system-on-a-chip, which includes at least one cluster, and each cluster includes multiple processor cores. In other words, the multi-core computing device 41 is a system-on-chip- Cluster - a hierarchy of processor cores.

以片上系统的层级来看，如图4所示，多核计算装置41包括外部存储控制器401、外设通信模块402、片上互联模块403、同步模块404以及多个集群405。From a system-on-chip level, as shown in FIG. 4 , the multi-core computing device 41 includes an external storage controller 401 , a peripheral communication module 402 , an on-chip interconnect module 403 , a synchronization module 404 and multiple clusters 405 .

外部存储控制器401可以有多个，在图中示例性地展示2个，其用以响应处理器核发出的访问请求，访问外部存储设备，例如图2中的DRAM204，从而自片外读取数据或是将数据写入。外设通信模块402用以通过接口装置202接收来自处理装置203的控制信号，启动计算装置201执行任务。片上互联模块403将外部存储控制器401、外设通信模块402及多个集群405连接起来，用以在各个模块间传输数据和控制信号。同步模块404是一种全局同步屏障控制器(global barrier controller，GBC)，用以协调各集群的工作进度，确保信息的同步。多个集群405是多核计算装置41的计算核心，在图中示例性地展示4个，随着硬件的发展，本披露的多核计算装置41还可以包括8个、16个、64个、甚至更多的集群405。集群405用以高效地执行深度学习算法。There may be multiple external memory controllers 401, and two are exemplarily shown in the figure, which are used to respond to an access request issued by the processor core to access an external storage device, such as the DRAM 204 in FIG. 2, so as to read from off-chip data or write data. The peripheral communication module 402 is used for receiving a control signal from the processing device 203 through the interface device 202 to start the computing device 201 to perform tasks. The on-chip interconnection module 403 connects the external storage controller 401 , the peripheral communication module 402 and the multiple clusters 405 to transmit data and control signals among the modules. The synchronization module 404 is a global synchronization barrier controller (global barrier controller, GBC), which is used to coordinate the work progress of each cluster and ensure the synchronization of information. The plurality of clusters 405 are the computing cores of the multi-core computing device 41, and 4 are exemplarily shown in the figure. With the development of hardware, the multi-core computing device 41 of the present disclosure may also include 8, 16, 64, or even more. Multiple clusters 405. Cluster 405 is used to efficiently execute deep learning algorithms.

以集群的层级来看，如图4所示，每个集群405包括多个处理器核(IPU core)406及一个存储核(MEM core)407。In terms of cluster level, as shown in FIG. 4 , each cluster 405 includes multiple processor cores (IPU cores) 406 and one memory core (MEM core) 407 .

处理器核406在图中示例性地展示4个，本披露不限制处理器核406的数量。其内部架构如图5所示。每个处理器核406类似于图3的单核计算装置301，同样包括三大模块：控制模块51、运算模块52及存储模块53。控制模块51、运算模块52及存储模块53的功用及结构大致与控制模块31、运算模块32及存储模块33相同，不再赘述。需特别说明的是，存储模块53包括输入/输出直接内存访问模块(input/output direct memory access，IODMA)533、搬运直接内存访问模块(move direct memory access，MVDMA)534。IODMA 533通过广播总线409控制NRAM 531/WRAM 532与DRAM 204的访存；MVDMA 534则用以控制NRAM 531/WRAM 532与存储单元(SRAM)408的访存。The processor cores 406 are exemplarily shown as four in the figure, and the present disclosure does not limit the number of the processor cores 406 . Its internal structure is shown in Figure 5. Each processor core 406 is similar to the single-core computing device 301 in FIG. 3 , and also includes three major modules: a control module 51 , an arithmetic module 52 and a storage module 53 . The functions and structures of the control module 51 , the arithmetic module 52 and the storage module 53 are substantially the same as those of the control module 31 , the arithmetic module 32 and the storage module 33 , and will not be described again. It should be noted that the storage module 53 includes an input/output direct memory access (IODMA) 533 and a move direct memory access (MVDMA) 534 . The IODMA 533 controls the memory access of the NRAM 531/WRAM 532 and the DRAM 204 through the broadcast bus 409; the MVDMA 534 is used to control the memory access of the NRAM 531/WRAM 532 and the storage unit (SRAM) 408.

回到图4，存储核407主要用以存储和通信，即存储处理器核406间的共享数据或中间结果、以及执行集群405与DRAM 204之间的通信、集群405间彼此的通信、处理器核406间彼此的通信等。在其他实施例中，存储核407具有标量运算的能力，用以执行标量运算。Returning to FIG. 4 , the storage core 407 is mainly used for storage and communication, that is, to store the shared data or intermediate results between the processor cores 406 , and to execute the communication between the cluster 405 and the DRAM 204 , the communication between the clusters 405 , and the processor Communication among the cores 406, etc. In other embodiments, the memory core 407 has scalar operation capability for performing scalar operations.

存储核407包括SRAM 408、广播总线409、集群直接内存访问模块(cluster directmemory access，CDMA)410及全局直接内存访问模块(global direct memory access，GDMA)411。SRAM 408承担高性能数据中转站的角色，在同一个集群405内不同处理器核406之间所复用的数据不需要通过处理器核406各自向DRAM 204获得，而是经SRAM 408在处理器核406间中转，存储核407只需要将复用的数据从SRAM 408迅速分发给多个处理器核406即可，以提高核间通讯效率，亦大大减少片上片外的输入/输出访问。The storage core 407 includes an SRAM 408 , a broadcast bus 409 , a cluster direct memory access (CDMA) 410 and a global direct memory access (GDMA) 411 . The SRAM 408 plays the role of a high-performance data transfer station. The data multiplexed between different processor cores 406 in the same cluster 405 does not need to be obtained from the DRAM 204 through the processor cores 406, but is stored in the processor through the SRAM 408. For transfer between cores 406, the storage core 407 only needs to quickly distribute the multiplexed data from the SRAM 408 to the multiple processor cores 406, so as to improve the communication efficiency between the cores and greatly reduce the on-chip and off-chip I/O accesses.

广播总线409、CDMA 410及GDMA 411则分别用来执行处理器核406间的通信、集群405间的通信和集群405与DRAM 204的数据传输。以下将分别说明。Broadcast bus 409 , CDMA 410 and GDMA 411 are used to perform communication between processor cores 406 , communication between clusters 405 and data transfer between cluster 405 and DRAM 204 , respectively. They will be explained separately below.

广播总线409用以完成集群405内各处理器核406间的高速通信，此实施例的广播总线409支持核间通信方式包括单播、多播与广播。单播是指点对点(例如单一处理器核至单一处理器核)的数据传输，多播是将一份数据从SRAM 408传输到特定几个处理器核406的通信方式，而广播则是将一份数据从SRAM 408传输到所有处理器核406的通信方式，属于多播的一种特例。The broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405. The broadcast bus 409 in this embodiment supports inter-core communication methods including unicast, multicast and broadcast. Unicast refers to point-to-point (such as a single processor core to a single processor core) data transmission, multicast is a communication method that transmits a piece of data from SRAM 408 to specific processor cores 406, and broadcast is a communication method. The communication method in which copies of data are transmitted from SRAM 408 to all processor cores 406 is a special case of multicast.

CDMA 410用以控制在同一个计算装置201内不同集群405间的SRAM 408的访存。The CDMA 410 is used to control the memory access of the SRAM 408 between different clusters 405 within the same computing device 201 .

GDMA 411与外部存储控制器401协同，用以控制集群405的SRAM 408到DRAM 204的访存，或是将数据自DRAM 204读取至SRAM 408中。从前述可知，DRAM 204与NRAM 431或WRAM432间的通信可以经由2个渠道来实现。第一个渠道是通过IODAM 433直接联系DRAM 204与NRAM 431或WRAM 432；第二个渠道是先经由GDMA 411使得数据在DRAM 204与SRAM 408间传输，再经过MVDMA 534使得数据在SRAM 408与NRAM 431或WRAM 432间传输。虽然表面上看来第二个渠道需要更多的元件参与，数据流较长，但实际上在部分实施例中，第二个渠道的带宽远大于第一个渠道，因此DRAM 204与NRAM 431或WRAM 432间的通信通过第二个渠道可能更有效率。本披露的实施例可根据本身硬件条件选择数据传输渠道。The GDMA 411 cooperates with the external memory controller 401 to control the memory access from the SRAM 408 of the cluster 405 to the DRAM 204 , or to read data from the DRAM 204 to the SRAM 408 . As can be seen from the foregoing, the communication between the DRAM 204 and the NRAM 431 or the WRAM 432 can be implemented through two channels. The first channel is to directly contact DRAM 204 and NRAM 431 or WRAM 432 through IODAM 433; the second channel is to transfer data between DRAM 204 and SRAM 408 through GDMA 411, and then through MVDMA 534 to make data between SRAM 408 and NRAM 431 or WRAM 432 transfers. Although on the surface it seems that the second channel requires more components to participate and the data flow is longer, in fact, in some embodiments, the bandwidth of the second channel is much larger than that of the first channel, so the DRAM 204 and NRAM 431 or Communication between the WRAMs 432 may be more efficient through a second channel. The embodiments of the present disclosure can select data transmission channels according to their own hardware conditions.

在其他实施例中，GDMA 411的功能和IODMA 533的功能可以整合在同一部件中。本披露为了方便描述，将GDMA 411和IODMA 533视为不同部件，对于本领域技术人员来说，只要其实现的功能以及达到的技术效果与本披露类似，即属于本披露的保护范围。进一步地，GDMA 411的功能、IODMA 533的功能、CDMA 410的功能、MVDMA 534的功能亦可以由同一部件来实现。In other embodiments, the functionality of GDMA 411 and the functionality of IODMA 533 may be integrated in the same component. In the present disclosure, for the convenience of description, GDMA 411 and IODMA 533 are regarded as different components. For those skilled in the art, as long as the functions and technical effects achieved are similar to those of the present disclosure, they belong to the protection scope of the present disclosure. Further, the functions of the GDMA 411, the functions of the IODMA 533, the functions of the CDMA 410, and the functions of the MVDMA 534 can also be implemented by the same component.

传统的处理器的指令被设计为能够执行基本的单数据标量操作。这里，单数据标量操作指的是指令的每一个操作数都是一个标量数据。然而，随着人工智能技术的发展，在诸如图像处理和模式识别等的任务中，面向的操作数往往是多维向量(即，张量数据)的数据类型，仅仅使用标量操作无法使硬件高效地完成运算任务。因此，如何高效地执行多维的张量数据处理也是当前计算领域亟需解决的问题。The instructions of conventional processors are designed to perform basic single-data scalar operations. Here, a single-data scalar operation means that each operand of the instruction is a scalar data. However, with the development of artificial intelligence technology, in tasks such as image processing and pattern recognition, the oriented operands are often multi-dimensional vector (ie, tensor data) data types, and only using scalar operations cannot make hardware efficient Complete the operation task. Therefore, how to efficiently perform multi-dimensional tensor data processing is also an urgent problem to be solved in the current computing field.

在本披露的实施例中，提供了一种结构化稀疏指令，其用于执行与张量数据的结构化稀疏相关的操作。在该结构化稀疏指令的至少一个操作数中包括至少一个描述符，通过该描述符可以获取与张量数据相关的信息。具体地，描述符可以指示以下至少一项信息：张量数据的形状信息、张量数据的空间信息。张量数据的形状信息可以用于确定与该操作数对应的张量数据在数据存储空间中的数据地址。张量数据的空间信息可以用于确定指令之间的依赖关系，进而可以确定例如指令的执行顺序。In an embodiment of the present disclosure, a structured sparse instruction is provided for performing operations related to structured sparseness of tensor data. At least one descriptor is included in at least one operand of the structured sparse instruction through which information related to tensor data can be obtained. Specifically, the descriptor may indicate at least one of the following information: shape information of tensor data, and spatial information of tensor data. The shape information of the tensor data can be used to determine the data address in the data storage space of the tensor data corresponding to the operand. Spatial information of tensor data can be used to determine dependencies between instructions, which in turn can determine, for example, the execution order of instructions.

在一种可能的实现中，张量数据的空间信息可以通过空间标识(ID)来指示。空间ID也可以称为空间别名，其指代用于存储对应的张量数据的一个空间区域，该空间区域可以是一段连续的空间，也可以是多段空间，本披露对于空间区域的具体组成形式没有限制。不同的空间ID表示所指向的空间区域不存在依赖关系。In one possible implementation, spatial information of tensor data may be indicated by a spatial identification (ID). A space ID can also be called a space alias, which refers to a space area used to store the corresponding tensor data. The space area can be a continuous space or multiple space. This disclosure does not have any specific composition of the space area. limit. Different spatial IDs indicate that there is no dependency between the pointed spatial regions.

下面将结合附图详细描述张量数据的形状信息的各种可能实现方式。Various possible implementations of the shape information of tensor data will be described in detail below with reference to the accompanying drawings.

张量可以包含多种形式的数据组成方式。张量可以是不同维度的，比如标量可以看作是0维张量，向量可以看作1维张量，而矩阵可以是2维或2维以上的张量。张量的形状包括张量的维度、张量各个维度的尺寸等信息。举例而言，对于三维张量：Tensors can contain many forms of data composition. Tensors can be of different dimensions. For example, scalars can be regarded as 0-dimensional tensors, vectors can be regarded as 1-dimensional tensors, and matrices can be 2-dimensional or more than 2-dimensional tensors. The shape of a tensor includes information such as the dimension of the tensor and the dimensions of each dimension of the tensor. For example, for a three-dimensional tensor:

x₃＝[[[1，2，3]，[4，5，6]]；[[7，8，9]，[10，11，12]]]x ₃ = [[[1, 2, 3], [4, 5, 6]]; [[7, 8, 9], [10, 11, 12]]]

该张量的形状或维度可以表示为X₃＝(2,2,3)，也即通过三个参数表示该张量为三维张量，且该张量的第一维度的尺寸为2、第二维度的尺寸为2、而第三维度的尺寸为3。在存储器中存储张量数据时，根据其数据地址(或存储区域)无法确定张量数据的形状，进而也无法确定多个张量数据之间相互关系等相关信息，导致处理器对张量数据的存取效率较低。The shape or dimension of the tensor can be expressed as X ₃ =(2,2,3), that is, three parameters indicate that the tensor is a three-dimensional tensor, and the size of the first dimension of the tensor is 2, the third The dimension of the second dimension is 2 and the dimension of the third dimension is 3. When storing tensor data in the memory, the shape of the tensor data cannot be determined according to its data address (or storage area), and further related information such as the relationship between multiple tensor data cannot be determined, resulting in the processor to the tensor data. access efficiency is low.

在一种可能的实现方式中，可以用描述符指示N维的张量数据的形状，N为正整数，例如N＝1、2或3，或者为零。上面示例中的三维张量用描述符可以表示为(2,2,3)。需要说明的是，本披露对于描述符指示张量形状的方式没有限制。In a possible implementation, the shape of the N-dimensional tensor data can be indicated by a descriptor, where N is a positive integer, such as N=1, 2, or 3, or zero. The three-dimensional tensor in the example above can be represented as (2, 2, 3) with descriptors. It should be noted that the present disclosure does not limit the manner in which the descriptor indicates the shape of the tensor.

在一种可能的实现方式中，N的取值可以根据张量数据的维数(也称为阶数)来确定，也可以根据张量数据的使用需要进行设定。例如，在N的取值为3时，张量数据为三维的张量数据，描述符可用来指示该三维的张量数据在三个维度方向上的形状(例如偏移量、尺寸等)。应当理解，本领域技术人员可以根据实际需要对N的取值进行设置，本披露对此不作限制。In a possible implementation manner, the value of N may be determined according to the dimension (also referred to as the order) of the tensor data, and may also be set according to the usage needs of the tensor data. For example, when the value of N is 3, the tensor data is three-dimensional tensor data, and the descriptor can be used to indicate the shape (eg offset, size, etc.) of the three-dimensional tensor data in the three-dimensional direction. It should be understood that those skilled in the art can set the value of N according to actual needs, which is not limited in the present disclosure.

虽然张量数据可以是多维的，但是因为存储器的布局始终是一维的，因此张量与存储器上的存储之间存在对应关系。张量数据通常被分配在连续的存储空间中，也即可以将张量数据进行一维展开(例如，行优先方式)，存储在存储器上。Although tensor data can be multi-dimensional, because the layout of memory is always one-dimensional, there is a correspondence between tensors and storage on memory. Tensor data is usually allocated in contiguous storage space, that is, the tensor data can be expanded one-dimensionally (eg, row-major manner) and stored on the memory.

张量与底层存储之间的这种关系可以通过维度的偏移量(offset)、维度的尺寸(size)、维度的步长(stride)等来表示。维度的偏移量是指在该维度上相对参考位置的偏移。维度的尺寸是指该维度的大小，也即该维度上元素的个数。维度的步长指的是在该维度下，相邻元素之间的间隔，例如上面三维张量的步长为(6,3,1)，也即第一维的步长是6，第二维的步长是3，第三维的步长是1。This relationship between tensors and the underlying storage can be represented by the offset of the dimension (offset), the size of the dimension (size), the stride of the dimension (stride), and so on. The offset of a dimension refers to the offset relative to the reference position in that dimension. The size of a dimension refers to the size of the dimension, that is, the number of elements in the dimension. The step size of the dimension refers to the interval between adjacent elements in this dimension. For example, the step size of the three-dimensional tensor above is (6,3,1), that is, the step size of the first dimension is 6, and the second step size is 6. The step size of the dimension is 3, and the step size of the third dimension is 1.

图6示出根据本披露实施例的数据存储空间的示意图。如图6所示，数据存储空间61采用行优先的方式存储了一个二维数据，可通过(x，y)来表示(其中，X轴水平向右，Y轴垂直向下)。X轴方向上的尺寸(每行的尺寸，或总列数)为ori_x(图中未示出)，Y轴方向上的尺寸(总行数)为ori_y(图中未示出)，数据存储空间61的起始地址PA_start(基准地址)为第一个数据块62的物理地址。数据块63是数据存储空间61中的部分数据，其在X轴方向上的偏移量65表示为offset_x，在Y轴方向上的偏移量64表示为offset_y，在X轴方向上的尺寸表示为size_x，在Y轴方向上的尺寸表示为size_y。FIG. 6 shows a schematic diagram of a data storage space according to an embodiment of the present disclosure. As shown in FIG. 6 , the data storage space 61 stores a two-dimensional data in a row-first manner, which can be represented by (x, y) (where the X axis is horizontally to the right, and the Y axis is vertically downward). The size in the X-axis direction (the size of each row, or the total number of columns) is ori_x (not shown in the figure), the size in the Y-axis direction (the total number of rows) is ori_y (not shown in the figure), and the data storage space The starting address PA_start (base address) of 61 is the physical address of the first data block 62 . The data block 63 is part of the data in the data storage space 61, the offset 65 in the X-axis direction is represented as offset_x, the offset 64 in the Y-axis direction is represented as offset_y, and the size in the X-axis direction is represented by is size_x, and the size in the Y-axis direction is represented by size_y.

在一种可能的实现方式中，使用描述符来定义数据块63时，描述符的数据基准点可以使用数据存储空间61的第一个数据块，可以约定描述符的基准地址为数据存储空间61的起始地址PA_start。然后可以结合数据存储空间61在X轴的尺寸ori_x、在Y轴上的尺寸ori_y，以及数据块63在Y轴方向的偏移量offset_y、X轴方向上的偏移量offset_x、X轴方向上的尺寸size_x以及Y轴方向上的尺寸size_y来确定数据块63的描述符的内容。In a possible implementation, when the descriptor is used to define the data block 63, the data reference point of the descriptor may use the first data block of the data storage space 61, and the reference address of the descriptor may be agreed as the data storage space 61 The starting address of PA_start. Then, the size ori_x of the data storage space 61 on the X axis, the size ori_y on the Y axis, and the offset amount offset_y in the Y axis direction, the offset amount offset_x in the X axis direction, the offset amount in the X axis direction, and the X axis direction of the data block 63 can be combined The content of the descriptor of the data block 63 is determined by the size size_x and the size size_y in the Y-axis direction.

在一种可能的实现方式中，可以使用下述公式(1)来表示描述符的内容：In a possible implementation, the following formula (1) can be used to represent the content of the descriptor:

应当理解，虽然上述示例中，描述符的内容表示的是二维空间，但本领域技术人员可以根据实际情况对描述符的内容表示的具体维度进行设置，本披露对此不作限制。It should be understood that although in the above example, the content of the descriptor represents a two-dimensional space, those skilled in the art can set the specific dimension of the content of the descriptor according to the actual situation, which is not limited in the present disclosure.

在一种可能的实现方式中，可以约定描述符的数据基准点在数据存储空间中的基准地址，在基准地址的基础上，根据处于N个维度方向的对角位置的至少两个顶点相对于数据基准点的位置，确定张量数据的描述符的内容。In a possible implementation manner, the reference address of the data reference point of the descriptor in the data storage space may be agreed upon. The location of the data datum point, which determines the content of the descriptor of the tensor data.

举例来说，可以约定描述符的数据基准点在数据存储空间中的基准地址PA_base。例如，可以在数据存储空间61中选取一个数据(例如，位置为(2，2)的数据)作为数据基准点，将该数据在数据存储空间中的物理地址作为基准地址PA_base。可以根据对角位置的两个顶点相对于数据基准点的位置，确定出图6中数据块63的描述符的内容。首先，确定数据块63的对角位置的至少两个顶点相对于数据基准点的位置，例如，使用左上至右下方向的对角位置顶点相对于数据基准点的位置，其中，左上角顶点的相对位置为(x_min，y_min)，右下角顶点的相对位置为(x_max，y_max)，然后可以根据基准地址PA_base、左上角顶点的相对位置(x_min，y_min)以及右下角顶点的相对位置(x_max，y_max)确定出数据块63的描述符的内容。For example, the base address PA_base of the data base point of the descriptor in the data storage space may be agreed. For example, a piece of data (for example, data at position (2, 2)) may be selected in the data storage space 61 as the data reference point, and the physical address of the data in the data storage space may be used as the reference address PA_base. The content of the descriptor of the data block 63 in FIG. 6 can be determined according to the positions of the two diagonal vertices relative to the data reference point. First, the positions of at least two vertices of the diagonal positions of the data block 63 relative to the data reference point are determined, for example, the positions of the diagonal position vertices relative to the data reference point in the upper left to lower right direction are used, wherein the position of the upper left vertex is The relative position is (x_min, y_min), the relative position of the lower right vertex is (x_max, y_max), and then the relative position of the upper left vertex (x_min, y_min) and the relative position of the lower right vertex (x_max, y_max) determines the content of the descriptor of the data block 63 .

在一种可能的实现方式中，可以使用下述公式(2)来表示描述符的内容(基准地址为PA_base)：In a possible implementation, the following formula (2) can be used to represent the content of the descriptor (the base address is PA_base):

应当理解，虽然上述示例中使用左上角和右下角两个对角位置的顶点来确定描述符的内容，但本领域技术人员可以根据实际需要对对角位置的至少两个顶点的具体顶点进行设置，本披露对此不作限制。It should be understood that, although the vertices in the upper left corner and the lower right corner are used to determine the content of the descriptor in the above example, those skilled in the art can set the specific vertices of the at least two vertices in the diagonal positions according to actual needs. , this disclosure does not limit this.

在一种可能的实现方式中，可根据描述符的数据基准点在数据存储空间中的基准地址，以及描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系，确定张量数据的描述符的内容。其中，数据描述位置与数据地址之间的映射关系可以根据实际需要进行设定，例如，描述符所指示的张量数据为三维空间数据时，可以使用函数f(x，y，z)来定义数据描述位置与数据地址之间的映射关系。In a possible implementation manner, the tensor can be determined according to the reference address of the data reference point of the descriptor in the data storage space and the mapping relationship between the data description position and the data address of the tensor data indicated by the descriptor. The content of the descriptor of the quantity data. Among them, the mapping relationship between the data description position and the data address can be set according to actual needs. For example, when the tensor data indicated by the descriptor is three-dimensional space data, the function f(x, y, z) can be used to define The data describes the mapping relationship between the location and the data address.

在一种可能的实现方式中，可以使用下述公式(3)来表示描述符的内容：In a possible implementation, the following formula (3) can be used to represent the content of the descriptor:

在一种可能的实现方式中，描述符还用于指示N维的张量数据的地址，其中，描述符的内容还包括表示张量数据的地址的至少一个地址参数，例如描述符的内容可以是下式(4)：In a possible implementation manner, the descriptor is further used to indicate the address of N-dimensional tensor data, wherein the content of the descriptor further includes at least one address parameter representing the address of the tensor data, for example, the content of the descriptor may be is the following formula (4):

其中PA为地址参数。地址参数可以是逻辑地址，也可以是物理地址。在对描述符进行解析时可以以PA为向量形状的顶点、中间点或预设点中的任意一个，结合X方向和Y方向的形状参数得到对应的数据地址。Where PA is the address parameter. The address parameter can be a logical address or a physical address. When parsing the descriptor, PA can be used as any one of the vertex, middle point or preset point of the vector shape, and the corresponding data address can be obtained by combining the shape parameters in the X direction and the Y direction.

在一种可能的实现方式中，张量数据的地址参数包括描述符的数据基准点在该张量数据的数据存储空间中的基准地址，基准地址包括该数据存储空间的起始地址。In a possible implementation manner, the address parameter of the tensor data includes a reference address of the data reference point of the descriptor in the data storage space of the tensor data, and the reference address includes a start address of the data storage space.

在一种可能的实现方式中，描述符还可以包括表示张量数据的地址的至少一个地址参数，例如描述符的内容可以是下式(5)：In a possible implementation manner, the descriptor may further include at least one address parameter representing the address of the tensor data, for example, the content of the descriptor may be the following formula (5):

其中PA_start为基准地址参数，不再赘述。Among them, PA_start is a reference address parameter, which is not repeated here.

应当理解，本领域技术人员可以根据实际情况对数据描述位置与数据地址之间的映射关系进行设定，本披露对此不作限制。It should be understood that those skilled in the art can set the mapping relationship between the data description location and the data address according to the actual situation, which is not limited in the present disclosure.

在一种可能的实现方式中，可以在一个任务中设定约定的基准地址，此任务下指令中的描述符均使用此基准地址，描述符内容中可以包括基于此基准地址的形状参数。可以通过设定此任务的环境参数的方式确定此基准地址。基准地址的相关描述和使用方式可参见上述实施例。此种实现方式下，描述符的内容可以更快速地被映射为数据地址。In a possible implementation manner, a predetermined reference address may be set in a task, the descriptors in the instructions under this task all use the reference address, and the content of the descriptor may include shape parameters based on the reference address. The base address can be determined by setting the environment parameters for this task. For the relevant description and usage of the reference address, reference may be made to the foregoing embodiments. In this implementation, the content of the descriptor can be mapped to the data address more quickly.

在一种可能的实现方式中，可以在各描述符的内容中包含基准地址，则各描述符的基准地址可不同。相对于利用环境参数设定共同的基准地址的方式，此种方式中的各描述符可以更加灵活地描述数据，并使用更大的数据地址空间。In a possible implementation manner, a reference address may be included in the content of each descriptor, and the reference address of each descriptor may be different. Compared with the way of setting a common reference address by using environment parameters, each descriptor in this way can describe data more flexibly and use a larger data address space.

在一种可能的实现方式中，可根据描述符的内容，确定与处理指令的操作数对应的数据在数据存储空间中的数据地址。其中，数据地址的计算由硬件自动完成，且描述符的内容的表示方式不同时，数据地址的计算方法也会不同。本披露对数据地址的具体计算方法不作限制。In a possible implementation manner, the data address in the data storage space of the data corresponding to the operand of the processing instruction may be determined according to the content of the descriptor. Among them, the calculation of the data address is automatically completed by the hardware, and when the representation of the content of the descriptor is different, the calculation method of the data address is also different. This disclosure does not limit the specific calculation method of the data address.

例如，操作数中描述符的内容是使用公式(1)表示的，描述符所指示的张量数据在数据存储空间中的偏移量分别为offset_x和offset_y，尺寸为size_x*size_y，那么，该描述符所指示的张量数据在数据存储空间中的起始数据地址PA1_(x,y)可以使用下述公式(6)来确定：For example, the content of the descriptor in the operand is represented by formula (1), the offsets of the tensor data indicated by the descriptor in the data storage space are offset_x and offset_y respectively, and the size is size_x*size_y, then, the The starting data address PA1 _{(x, y)} of the tensor data indicated by the descriptor in the data storage space can be determined using the following formula (6):

PA1_(x,y)＝PA_start+(offset_y-1)*ori_x+offset_x (6)PA1 _(x,y) = PA_start+(offset_y-1)*ori_x+offset_x (6)

根据上述公式(6)确定的数据起始地址PA1_(x,y)，结合偏移量offset_x和offset_y，以及存储区域的尺寸size_x和size_y，可确定出描述符所指示的张量数据在数据存储空间中的存储区域。According to the data starting address PA1 _(x,y ) determined by the above formula (6), combined with the offset offset_x and offset_y, and the size_x and size_y of the storage area, it can be determined that the tensor data indicated by the descriptor is stored in the data storage area. storage area in space.

在一种可能的实现方式中，当操作数还包括针对描述符的数据描述位置时，可根据描述符的内容以及数据描述位置，确定操作数对应的数据在数据存储空间中的数据地址。通过这种方式，可以对描述符所指示的张量数据中的部分数据(例如一个或多个数据)进行处理。In a possible implementation manner, when the operand further includes a data description location for the descriptor, the data address in the data storage space of the data corresponding to the operand can be determined according to the content of the descriptor and the data description location. In this way, part of the data (eg, one or more data) in the tensor data indicated by the descriptor can be processed.

例如，操作数中描述符的内容是使用公式(2)表示的，描述符所指示的张量数据在数据存储空间中偏移量分别为offset_x和offset_y，尺寸为size_x*size_y，操作数中包括的针对描述符的数据描述位置为(x_q，y_q)，那么，该描述符所指示的张量数据在数据存储空间中的数据地址PA2_(x,y)可以使用下述公式(7)来确定：For example, the content of the descriptor in the operand is represented by formula (2). The offsets of the tensor data indicated by the descriptor in the data storage space are offset_x and offset_y respectively, and the size is size_x*size_y. The operand includes The data description position for the descriptor is (x _q , y _q ), then, the data address PA2 _{(x, y)} of the tensor data indicated by the descriptor in the data storage space can use the following formula (7) to make sure:

PA2_(x,y)＝PA_start+(offset_y+y_q-1)*ori_x+(offset_x+x_q) (7)PA2 _(x,y) = PA_start+(offset_y+y _q -1)*ori_x+(offset_x+x _q ) (7)

在一种可能的实现方式中，描述符可以指示分块的数据。数据分块在很多应用中可以有效地加快运算速度，提高处理效率。例如，在图形处理中，卷积运算经常使用数据分块进行快速运算处理。In one possible implementation, the descriptor may indicate chunked data. Data block can effectively speed up the operation speed and improve the processing efficiency in many applications. For example, in graphics processing, convolution operations often use data blocks for fast processing.

图7示出根据本披露实施例的数据分块在数据存储空间中的示意图。如图7所示，数据存储空间700同样采用行优先的方式存储二维数据，可通过(x，y)来表示(其中，X轴水平向右，Y轴垂直向下)。X轴方向上的尺寸(每行的尺寸，或总列数)为ori_x(图中未示出)，Y轴方向上的尺寸(总行数)为ori_y(图中未示出)。不同于图6的张量数据，图7中存储的张量数据包括多个数据分块。FIG. 7 shows a schematic diagram of a data block in a data storage space according to an embodiment of the present disclosure. As shown in FIG. 7 , the data storage space 700 also stores two-dimensional data in a row-first manner, which can be represented by (x, y) (where the X axis is horizontally to the right, and the Y axis is vertically downward). The size in the X-axis direction (the size of each row, or the total number of columns) is ori_x (not shown in the figure), and the size in the Y-axis direction (the total number of rows) is ori_y (not shown in the figure). Unlike the tensor data of FIG. 6 , the tensor data stored in FIG. 7 includes multiple data blocks.

在这种情况下，描述符需要更多的参数来表示这些数据分块。以X轴(X维度)为例，可以涉及如下参数：ori_x，x.tile.size(分块中的尺寸702)，x.tile.stride(分块中的步长704，即第一个小块的第一个点与第二个小块的第一个点的距离)，x.tile.num(分块数量，图中示出为3个分块)，x.stride(整体的步长，即第一行的第一个点到第二行第一个点的距离)等。其他维度可以类似地包括对应的参数。In this case, the descriptor requires more parameters to represent these data chunks. Taking the X axis (X dimension) as an example, the following parameters can be involved: ori_x, x.tile.size (size 702 in the block), x.tile.stride (step size 704 in the block, that is, the first small The distance between the first point of the block and the first point of the second small block), x.tile.num (the number of blocks, shown as 3 blocks in the figure), x.stride (the overall step size) , that is, the distance from the first point of the first row to the first point of the second row) and so on. Other dimensions may similarly include corresponding parameters.

在一种可能的实现方式中，描述符可以包括描述符的标识和/或描述符的内容。其中，描述符的标识用于对描述符进行区分，例如描述符的标识可以为其编号；描述符的内容可以包括表示张量数据的形状的至少一个形状参数。例如，张量数据为3维数据，在该张量数据的三个维度中，其中两个维度的形状参数固定不变，其描述符的内容可包括表示该张量数据的另一个维度的形状参数。In one possible implementation, the descriptor may include the identifier of the descriptor and/or the content of the descriptor. The identifier of the descriptor is used to distinguish the descriptor, for example, the identifier of the descriptor may be numbered; the content of the descriptor may include at least one shape parameter representing the shape of the tensor data. For example, the tensor data is 3-dimensional data. Among the three dimensions of the tensor data, the shape parameters of two dimensions are fixed, and the content of the descriptor may include the shape representing the other dimension of the tensor data. parameter.

在一种可能的实现方式中，描述符的标识和/或内容可以存储在描述符存储空间(内部存储器)，例如寄存器、片上的SRAM或其他介质缓存等。描述符所指示的张量数据可以存储在数据存储空间(内部存储器或外部存储器)，例如片上缓存或片下存储器等。本披露对描述符存储空间及数据存储空间的具体位置不作限制。In a possible implementation manner, the identifier and/or content of the descriptor may be stored in a descriptor storage space (internal memory), such as a register, on-chip SRAM or other medium caches, and the like. The tensor data indicated by the descriptor can be stored in the data storage space (internal memory or external memory), such as on-chip cache or off-chip memory, etc. The present disclosure does not limit the specific locations of the descriptor storage space and the data storage space.

在一种可能的实现方式中，描述符的标识、内容以及描述符所指示的张量数据可以存储在内部存储器的同一块区域，例如，可使用片上缓存的一块连续区域来存储描述符的相关内容，其地址为ADDR0-ADDR1023。其中，可将地址ADDR0-ADDR63作为描述符存储空间，存储描述符的标识和内容，地址ADDR64-ADDR1023作为数据存储空间，存储描述符所指示的张量数据。在描述符存储空间中，可用地址ADDR0-ADDR31存储描述符的标识，地址ADDR32-ADDR63存储描述符的内容。应当理解，地址ADDR并不限于1位或一个字节，此处用来表示一个地址，是一个地址单位。本领域技术人员可以实际情况确定描述符存储空间、数据存储空间以及其具体地址，本披露对此不作限制。In a possible implementation, the identifier, content of the descriptor, and tensor data indicated by the descriptor can be stored in the same area of the internal memory, for example, a continuous area of the on-chip cache can be used to store the related information of the descriptor content, its address is ADDR0-ADDR1023. Among them, the addresses ADDR0-ADDR63 can be used as the descriptor storage space to store the identifier and content of the descriptor, and the addresses ADDR64-ADDR1023 can be used as the data storage space to store the tensor data indicated by the descriptor. In the descriptor storage space, addresses ADDR0-ADDR31 can be used to store the identifier of the descriptor, and addresses ADDR32-ADDR63 can be used to store the content of the descriptor. It should be understood that the address ADDR is not limited to 1 bit or one byte, and is used here to represent an address, which is an address unit. Those skilled in the art can determine the descriptor storage space, the data storage space and their specific addresses according to actual conditions, which are not limited in this disclosure.

在一种可能的实现方式中，描述符的标识、内容以及描述符所指示的张量数据可以存储在内部存储器的不同区域。例如，可以将寄存器作为描述符存储空间，在寄存器中存储描述符的标识及内容，将片上缓存作为数据存储空间，存储描述符所指示的张量数据。In one possible implementation, the identifier, content of the descriptor, and tensor data indicated by the descriptor may be stored in different areas of the internal memory. For example, a register can be used as a descriptor storage space to store the identifier and content of the descriptor in the register, and an on-chip cache can be used as a data storage space to store the tensor data indicated by the descriptor.

在一种可能的实现方式中，在使用寄存器存储描述符的标识和内容时，可以使用寄存器的编号来表示描述符的标识。例如，寄存器的编号为0时，其存储的描述符的标识设置为0。当寄存器中的描述符有效时，可根据描述符所指示的张量数据的大小在缓存空间中分配一块区域用于存储该张量数据。In a possible implementation manner, when a register is used to store the identifier and content of the descriptor, the number of the register may be used to represent the identifier of the descriptor. For example, when the number of the register is 0, the identifier of the descriptor it stores is set to 0. When the descriptor in the register is valid, an area can be allocated in the cache space for storing the tensor data according to the size of the tensor data indicated by the descriptor.

在一种可能的实现方式中，描述符的标识及内容可存储在内部存储器，描述符所指示的张量数据可存储在外部存储器。例如，可以采用在片上存储描述符的标识及内容、在片下存储描述符所指示的张量数据的方式。In a possible implementation manner, the identifier and content of the descriptor may be stored in an internal memory, and the tensor data indicated by the descriptor may be stored in an external memory. For example, the identifier and content of the descriptor can be stored on-chip, and the tensor data indicated by the descriptor can be stored off-chip.

在一种可能的实现方式中，与各描述符对应的数据存储空间的数据地址可以是固定地址。例如，可以为张量数据划分单独的数据存储空间，每个张量数据在数据存储空间的起始地址与描述符一一对应。在这种情况下，负责对计算指令进行解析的电路或模块(例如本披露计算装置外部的实体)可以根据描述符来确定与操作数对应的数据在数据存储空间中的数据地址。In a possible implementation manner, the data address of the data storage space corresponding to each descriptor may be a fixed address. For example, a separate data storage space can be divided for tensor data, and the starting address of each tensor data in the data storage space corresponds to a descriptor one-to-one. In this case, the circuit or module responsible for parsing the computing instruction (eg, an entity external to the computing device of the present disclosure) can determine the data address in the data storage space of the data corresponding to the operand according to the descriptor.

在一种可能的实现方式中，在与描述符对应的数据存储空间的数据地址为可变地址时，描述符还可用于指示N维的张量数据的地址，其中，描述符的内容还可包括表示张量数据的地址的至少一个地址参数。例如，张量数据为3维数据，在描述符指向该张量数据的地址时，描述符的内容可包括表示该张量数据的地址的一个地址参数，例如张量数据的起始物理地址，也可以包括该张量数据的地址的多个地址参数，例如张量数据的起始地址+地址偏移量，或张量数据基于各维度的地址参数。本领域技术人员可以根据实际需要对地址参数进行设置，本披露对此不作限制。In a possible implementation manner, when the data address of the data storage space corresponding to the descriptor is a variable address, the descriptor can also be used to indicate the address of N-dimensional tensor data, wherein the content of the descriptor can also be Include at least one address parameter representing the address of the tensor data. For example, if the tensor data is 3-dimensional data, when the descriptor points to the address of the tensor data, the content of the descriptor may include an address parameter indicating the address of the tensor data, such as the starting physical address of the tensor data, It may also include multiple address parameters of the address of the tensor data, such as the start address + address offset of the tensor data, or address parameters of the tensor data based on each dimension. Those skilled in the art can set address parameters according to actual needs, which are not limited in the present disclosure.

在一种可能的实现方式中，张量数据的地址参数可以包括描述符的数据基准点在该张量数据的数据存储空间中的基准地址。其中，基准地址可根据数据基准点的变化而不同。本披露对数据基准点的选取不作限制。In a possible implementation manner, the address parameter of the tensor data may include the reference address of the data reference point of the descriptor in the data storage space of the tensor data. Among them, the reference address can be different according to the change of the data reference point. This disclosure does not limit the selection of data benchmarks.

在一种可能的实现方式中，基准地址可以包括数据存储空间的起始地址。在描述符的数据基准点是数据存储空间的第一个数据块时，描述符的基准地址即为数据存储空间的起始地址。在描述符的数据基准点是数据存储空间中第一个数据块以外的其他数据时，描述符的基准地址即为该数据块在数据存储空间中的地址。In one possible implementation, the reference address may include the start address of the data storage space. When the data reference point of the descriptor is the first data block of the data storage space, the reference address of the descriptor is the starting address of the data storage space. When the data reference point of the descriptor is other data than the first data block in the data storage space, the reference address of the descriptor is the address of the data block in the data storage space.

在一种可能的实现方式中，张量数据的形状参数包括以下至少一种：数据存储空间在N个维度方向的至少一个方向上的尺寸、该存储区域在N个维度方向的至少一个方向上的尺寸、该存储区域在N个维度方向的至少一个方向上的偏移量、处于N个维度方向的对角位置的至少两个顶点相对于数据基准点的位置、描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系。其中，数据描述位置是描述符所指示的张量数据中的点或区域的映射位置，例如，张量数据为3维数据时，描述符可使用三维空间坐标(x，y，z)来表示该张量数据的形状，该张量数据的数据描述位置可以是使用三维空间坐标(x，y，z)表示的、该张量数据映射在三维空间中的点或区域的位置。In a possible implementation manner, the shape parameter of the tensor data includes at least one of the following: the size of the data storage space in at least one direction of N dimensions, the size of the storage area in at least one direction of N dimensions The size of the storage area, the offset of the storage area in at least one direction of the N dimension directions, the position of at least two vertices at the diagonal positions of the N dimension directions relative to the data reference point, the tensor indicated by the descriptor The mapping relationship between the data description location of the data and the data address. The data description position is the mapping position of the point or area in the tensor data indicated by the descriptor. For example, when the tensor data is 3-dimensional data, the descriptor can be represented by three-dimensional space coordinates (x, y, z). The shape of the tensor data, and the data description position of the tensor data may be the position of a point or area in the three-dimensional space that the tensor data is mapped to, represented by three-dimensional space coordinates (x, y, z).

应当理解，本领域技术人员可以根据实际情况选择表示张量数据的形状参数，本披露对此不作限制。通过在数据存取过程中使用描述符，可建立数据之间的关联，从而降低数据存取的复杂度，提高指令处理效率。It should be understood that those skilled in the art can select the shape parameters representing the tensor data according to the actual situation, which is not limited in the present disclosure. By using descriptors in the data access process, the association between data can be established, thereby reducing the complexity of data access and improving the efficiency of instruction processing.

本披露的一个实施例基于前述的硬件环境，提供一种数据处理方案，根据专门的稀疏指令来执行与张量数据的结构化稀疏相关的操作。Based on the aforementioned hardware environment, an embodiment of the present disclosure provides a data processing solution that performs operations related to structured sparseness of tensor data according to special sparse instructions.

图8示出根据本披露实施例的数据处理装置800的结构框图。数据处理装置800例如可以实现在图2的计算装置201中。如图所示，数据处理装置800可以包括控制电路810、张量接口电路812、存储电路820和运算电路830。FIG. 8 shows a structural block diagram of a data processing apparatus 800 according to an embodiment of the present disclosure. The data processing device 800 may be implemented, for example, in the computing device 201 of FIG. 2 . As shown, the data processing apparatus 800 may include a control circuit 810 , a tensor interface circuit 812 , a storage circuit 820 and an arithmetic circuit 830 .

控制电路810的功能可以类似于图3的控制模块31或图5的控制模块51，其例如可以包括取指单元，用以获取来自例如图2的处理装置203的指令，以及指令译码单元，用于将获取的指令进行译码，并将译码结果作为控制信息发送给运算电路830和存储电路820。The function of the control circuit 810 may be similar to that of the control module 31 of FIG. 3 or the control module 51 of FIG. 5 , and it may include, for example, an instruction fetch unit to obtain instructions from, for example, the processing device 203 of FIG. 2 , and an instruction decoding unit, It is used to decode the acquired instruction, and send the decoded result to the operation circuit 830 and the storage circuit 820 as control information.

在一个实施例中，控制电路810可以配置用于解析稀疏指令，其中稀疏指令指示与结构化稀疏相关的操作，并且稀疏指令的至少一个操作数包括至少一个描述符，描述符指示以下至少一项信息：张量数据的形状信息和张量数据的空间信息。In one embodiment, the control circuit 810 may be configured to parse a sparse instruction, wherein the sparse instruction indicates an operation related to structured sparse and at least one operand of the sparse instruction includes at least one descriptor indicating at least one of the following Information: shape information of tensor data and spatial information of tensor data.

张量接口电路(Tensor interface Unit，TIU)812可以配置成在控制电路810的控制下，实现与描述符相关联的操作。这些操作可以包括但不限于描述符的注册、修改、注销、解析；对描述符内容的读写等。本披露对张量接口电路的具体硬件类型不作限制。通过这种方式，可以通过专用的硬件实现与描述符相关联的操作，进一步提高张量数据的存取效率。Tensor interface unit (TIU) 812 may be configured to implement operations associated with descriptors under the control of control circuit 810 . These operations may include, but are not limited to, registration, modification, cancellation, and parsing of descriptors; reading and writing of content of descriptors. The present disclosure does not limit the specific hardware type of the tensor interface circuit. In this way, operations associated with descriptors can be implemented through dedicated hardware, which further improves the access efficiency of tensor data.

在一些实施例中，张量接口电路812可以配置成对指令的操作数中包括的张量数据的形状信息进行解析，以确定与该操作数对应的数据在数据存储空间中的数据地址。In some embodiments, the tensor interface circuit 812 may be configured to parse the shape information of the tensor data included in the operand of the instruction to determine the data address in the data storage space of the data corresponding to the operand.

可选地或附加地，在又一些实施例中，张量接口电路812可以配置成对两条指令的操作数中包括的张量数据的空间信息(例如，空间ID)进行比较，以判断这两条指令的依赖关系，进而确定指令的乱序执行、同步等操作。Alternatively or additionally, in still other embodiments, the tensor interface circuit 812 may be configured to compare the spatial information (eg, spatial ID) of tensor data included in the operands of the two instructions to determine the The dependencies of the two instructions, and then determine the out-of-order execution, synchronization and other operations of the instructions.

尽管在图8中将控制电路810和张量接口电路812示出为两个分离的模块，但是本领域技术人员可以理解，这两个单元也可以实现为一个模块或更多模块，本披露在此方面没有限制。Although the control circuit 810 and the tensor interface circuit 812 are shown as two separate modules in FIG. 8 , those skilled in the art will understand that these two units may also be implemented as one module or more modules, and the present disclosure is described in There are no restrictions in this regard.

存储电路820可以配置用于存储稀疏化前和/或稀疏化后的信息。在一个实施例中，稀疏指令的操作数是神经网络的权值。在此实施例中，存储电路例如可以是图3的WRAM332或图5的WRAM 532。The storage circuit 820 may be configured to store information before and/or after thinning. In one embodiment, the operands of the sparse instruction are the weights of the neural network. In this embodiment, the storage circuit may be, for example, the WRAM 332 of FIG. 3 or the WRAM 532 of FIG. 5 .

运算电路830可以配置用于基于解析的描述符，根据稀疏指令执行相应的操作。The arithmetic circuit 830 may be configured to perform corresponding operations according to the sparse instructions based on the parsed descriptors.

在一些实施例中，运算电路830可以包括一组或多组流水运算电路831，其中每组流水运算电路831可以包括一个或多个运算器。当每组流水运算电路包括多个运算器时，这多个运算器可以配置成执行多级流水运算，也即构成一条多级运算流水线。In some embodiments, the operation circuit 830 may include one or more groups of pipeline operation circuits 831 , wherein each group of the pipeline operation circuits 831 may include one or more operators. When each group of pipeline operation circuits includes multiple operators, the multiple operators can be configured to perform multi-stage pipeline operations, that is, constitute a multi-stage operation pipeline.

在一些应用场景中，本披露的流水运算电路可以支持与结构化稀疏相关的操作。例如，在执行结构化稀疏处理时，可以采用由比较器等电路构成的多级流水运算电路来执行从每m个数据元素中提取n个数据元素作为有效数据元素的操作，其中m>n。在一个实现中，m＝4，n＝2。在另一些实现中，n也可以取其他值，例如1或3。In some application scenarios, the pipelined circuits of the present disclosure may support operations related to structured sparsity. For example, when performing structured sparse processing, a multi-stage pipeline operation circuit composed of circuits such as comparators can be used to perform the operation of extracting n data elements from every m data elements as valid data elements, where m>n. In one implementation, m=4 and n=2. In other implementations, n can also take other values, such as 1 or 3.

在一个实施例中，运算电路830还可以包括运算处理电路832，其可以配置成根据运算指令对流水运算电路831执行运算前的数据进行预处理或者对运算后的数据进行后处理。在一些应用场景中，前述的预处理和后处理可以例如包括数据拆分和/或数据拼接操作。在结构化稀疏处理中，运算处理电路可以将待稀疏化数据按照每m个数据元素进行分段拆分，然后送给流水运算电路831进行处理。In one embodiment, the operation circuit 830 may further include an operation processing circuit 832, which may be configured to preprocess the data before the operation performed by the pipeline operation circuit 831 or to perform post-processing on the data after the operation according to the operation instruction. In some application scenarios, the aforementioned preprocessing and postprocessing may include, for example, data splitting and/or data splicing operations. In the structured sparse processing, the operation processing circuit may divide the data to be sparsed into segments according to each m data element, and then send the data to the pipeline operation circuit 831 for processing.

图9A示出了根据本披露一个实施例的结构化稀疏处理的示例性运算流水线。在图9A的实施例中，示出了当m＝4，n＝2时，从4个数据元素A、B、C和D中筛选出2个绝对值较大的数据元素的结构化稀疏处理。9A illustrates an exemplary operational pipeline for structured sparse processing according to one embodiment of the present disclosure. In the embodiment of FIG. 9A , when m=4, n=2, the structured sparse processing of filtering out 2 data elements with larger absolute values from 4 data elements A, B, C and D is shown .

如图9A所示，可以利用由求绝对值运算器、比较器构成的4级流水运算电路来执行上述结构化稀疏处理。As shown in FIG. 9A , the above-described structured sparse processing can be performed using a 4-stage pipeline operation circuit composed of an absolute value operator and a comparator.

第一级流水运算电路可以包括4个求绝对值运算器910，用于同步地分别对4个输入的数据元素A、B、C和D进行求绝对值操作。The first-stage pipeline operation circuit may include four absolute value operators 910 for synchronously performing absolute value operations on the four input data elements A, B, C, and D, respectively.

第二级流水运算电路可以包括两个比较器，用于对上一级输出的4个绝对值进行分组比较。例如，第一比较器921可以对数据元素A和B的绝对值进行比较并输出较大值Max00，第二比较器922可以对数据元素C和D的绝对值进行比较并输出较大值Max10。The second-stage pipeline operation circuit may include two comparators for grouping and comparing the four absolute values output by the previous stage. For example, the first comparator 921 may compare the absolute values of data elements A and B and output the larger value Max00, and the second comparator 922 may compare the absolute values of the data elements C and D and output the larger value Max10.

第三级流水运算电路可以包括一个第三比较器930，对上一级输出的2个较大值Max00和Max10进行比较并输出较大值Max0。此较大值Max0即为这4个数据元素中绝对值最大的值。The third-stage pipeline operation circuit may include a third comparator 930, which compares the two larger values Max00 and Max10 output by the previous stage and outputs the larger value Max0. The larger value Max0 is the value with the largest absolute value among the four data elements.

第四级流水运算电路可以包括一个第四比较器940，对上一级中的较小值Min0与最大值Max0所在分组中的另一值进行比较并输出较大值Max1。此较大值Max1即为这4个数据元素中绝对值次大的值。The fourth stage pipeline operation circuit may include a fourth comparator 940, which compares the smaller value Min0 in the previous stage with another value in the group where the maximum value Max0 is located, and outputs the larger value Max1. The larger value Max1 is the value with the second largest absolute value among the four data elements.

由此，通过4级流水运算电路可以实现四选二的结构化稀疏处理。Therefore, the structured sparse processing of selecting two out of four can be realized through the 4-stage pipeline operation circuit.

图9B示出了根据本披露另一实施例的结构化稀疏处理的示例性运算流水线。同样地，在图9B的实施例中，示出了当m＝4，n＝2时，从4个数据元素A、B、C和D中筛选出2个绝对值较大的数据元素的结构化稀疏处理。9B illustrates an exemplary operational pipeline for structured sparse processing according to another embodiment of the present disclosure. Similarly, in the embodiment of FIG. 9B , when m=4, n=2, the structure of filtering out 2 data elements with larger absolute values from 4 data elements A, B, C and D is shown sparse processing.

如图9B所示，可以利用由求绝对值运算器、比较器等构成的多级流水运算电路来执行上述结构化稀疏处理。As shown in FIG. 9B , the above-described structured thinning process can be performed using a multi-stage pipeline operation circuit composed of an absolute value operator, a comparator, and the like.

第一流水级可以包括m(4)个求绝对值运算器950，用于同步地分别对4个输入的数据元素A、B、C和D进行求绝对值操作。为了便于最后输出有效数据元素，在一些实施例中，第一流水级会同时输出原数据元素(也即，A、B、C和D)和经过求绝对值操作后的数据(也即，|A|、|B|、|C|和|D|)。The first pipeline stage may include m(4) absolute value operators 950 for synchronously performing absolute value operations on the four input data elements A, B, C, and D, respectively. In order to facilitate the final output of valid data elements, in some embodiments, the first pipeline stage will simultaneously output the original data elements (ie, A, B, C, and D) and the data after the absolute value operation (ie, | A|, |B|, |C|, and |D|).

第二流水级可以包括排列组合电路960，用于对这m个绝对值进行排列组合，以生成m组数据，其中每组数据均包括这m个绝对值，并且这m个绝对值在各组数据中的位置互不相同。The second pipeline stage may include a permutation and combination circuit 960 for permuting and combining the m absolute values to generate m groups of data, wherein each group of data includes the m absolute values, and the m absolute values are in each group The locations in the data are different from each other.

在一些实施例中，排列组合电路可以是循环移位器，对m个绝对值(例如，|A|、|B|、|C|和|D|)的排列进行m-1次循环移位，从而生成m组数据。例如，在图中示出的示例中，生成4组数据，分别是：{|A|，|B|，|C|，|D|}、{|B|，|C|，|D|，|A|}、{|C|，|D|，|A|，|B|}和{|D|，|A|，|B|，|C|}。同样地，输出每组数据的同时还会输出对应的原数据元素，每组数据对应一个原数据元素。In some embodiments, the permutation combination circuit may be a cyclic shifter that performs m-1 cyclic shifts on a permutation of m absolute values (eg, |A|, |B|, |C|, and |D|) , so as to generate m sets of data. For example, in the example shown in the figure, four sets of data are generated, namely: {|A|, |B|, |C|, |D|}, {|B|, |C|, |D|, |A|}, {|C|, |D|, |A|, |B|}, and {|D|, |A|, |B|, |C|}. Similarly, when each group of data is output, the corresponding original data element is also output, and each group of data corresponds to one original data element.

第三流水级包括比较电路970，用于对这m组数据中的绝对值进行比较并生成比较结果。The third pipeline stage includes a comparison circuit 970 for comparing the absolute values in the m sets of data and generating a comparison result.

在一些实施例中，第三流水级可以包括m路比较电路，每路比较电路包括m-1个比较器(771，772，773)，第i路比较电路中的m-1个比较器用于将第i组数据中的一个绝对值与其他三个绝对值依次比较并生成比较结果，其中1≤i≤m。In some embodiments, the third pipeline stage may include m comparison circuits, each comparison circuit includes m-1 comparators (771, 772, 773), and m-1 comparators in the i-th comparison circuit are used for One absolute value in the i-th group of data is compared with the other three absolute values in turn and a comparison result is generated, where 1≤i≤m.

从图中可以看出，第三流水级也可以认为是m-1(3)个子流水级。每个子流水级包括m个比较器，用于将其对应的一个绝对值与其他绝对值进行比较。m-1个子流水级也就是依次将对应的一个绝对值与其他m-1个绝对值进行比较。As can be seen from the figure, the third pipeline stage can also be considered as m-1(3) sub-pipeline stages. Each sub-pipeline stage includes m comparators for comparing one of its corresponding absolute values with other absolute values. The m-1 sub-pipeline stages are to sequentially compare a corresponding absolute value with the other m-1 absolute values.

例如，在图中示出的示例中，第一子流水级中的4个比较器971用于分别将4组数据中的第一个绝对值与第二个绝对值进行比较，并分别输出比较结果w0、x0、y0和z0。第二子流水级中的4个比较器972用于分别将4组数据中的第一个绝对值与第三个绝对值进行比较，并分别输出比较结果w1、x1、y1和z1。第三子流水级中的4个比较器973用于分别将4组数据中的第一个绝对值与第四个绝对值进行比较，并分别输出比较结果w2、x2、y2和z2。For example, in the example shown in the figure, the four comparators 971 in the first sub-pipeline stage are used to compare the first absolute value and the second absolute value of the four sets of data respectively, and output the comparison respectively. Results w0, x0, y0 and z0. The four comparators 972 in the second sub-pipeline stage are used to compare the first absolute value and the third absolute value of the four sets of data respectively, and output the comparison results w1, x1, y1 and z1 respectively. The four comparators 973 in the third sub-pipeline stage are used to compare the first absolute value with the fourth absolute value in the four sets of data respectively, and output the comparison results w2, x2, y2 and z2 respectively.

由此，可以得到每个绝对值与其他m-1个绝对值的比较结果。Thus, the comparison result of each absolute value and the other m-1 absolute values can be obtained.

在一些实施例中，比较结果可以使用位图来表示。例如，在第1路比较电路的第1个比较器处，当|A|≥|B|时，w0＝1；在第1路第2个比较器处，当|A|＜|C|时，w1＝0；在第1路第3个比较器处，当|A|≥|D|时，w2＝1，由此，第1路比较电路的输出结果是{A，w0，w1，w2}，此时为{A，1，0，1}。类似地，第2路比较电路的输出结果是{B，x0，x1，x2}，第3路比较电路的输出结果是{C，y0，y1，y2}，第4路比较电路的输出结果是{D，z0，z1，z2}。In some embodiments, the comparison result may be represented using a bitmap. For example, at the first comparator of the first comparison circuit, when |A|≥|B|, w0=1; at the second comparator of the first channel, when |A|<|C| , w1=0; at the third comparator of the first channel, when |A|≥|D|, w2=1, thus, the output result of the first comparison circuit is {A, w0, w1, w2 }, this time {A, 1, 0, 1}. Similarly, the output of the second comparison circuit is {B, x0, x1, x2}, the output of the third comparison circuit is {C, y0, y1, y2}, and the output of the fourth comparison circuit is {D, z0, z1, z2}.

第四流水级包括筛选电路980，用于根据第三级的比较结果，从m个数据元素中选择n个绝对值较大的数据元素作为有效数据元素，以及输出这些有效数据元素及对应的索引。索引用于指示这些有效数据元素在输入的m个数据元素中的位置。例如，当从A、B、C、D四个数据元素中筛选出A和C时，其对应的索引可以是0和2。The fourth pipeline stage includes a screening circuit 980 for selecting n data elements with larger absolute values from the m data elements as valid data elements according to the comparison result of the third stage, and outputting these valid data elements and corresponding indexes . The index is used to indicate the position of these valid data elements within the input m data elements. For example, when A and C are filtered out from the four data elements of A, B, C, and D, their corresponding indices can be 0 and 2.

根据比较结果，可以设计合适的逻辑来选择绝对值较大的n个数据元素。考虑到可能出现多个绝对值相同的情况，在进一步的实施例中，当存在绝对值相同的数据元素时，按照指定的优先级顺序来进行选择。例如，可以按照索引从低到高固定优先级的方式，设置A的优先级最高，D的优先级最低。在一个示例中，当A、C、D三个数的绝对值均相同并且大于B的绝对值时，选择的数据为A和C。According to the comparison results, appropriate logic can be designed to select n data elements with larger absolute values. Considering that there may be multiple situations with the same absolute value, in a further embodiment, when there are data elements with the same absolute value, selection is performed according to a specified priority order. For example, the priority of A can be set to be the highest and the priority of D to be the lowest according to the way of fixing the priority from low to high. In one example, when the absolute values of the three numbers A, C, and D are the same and greater than the absolute value of B, the selected data are A and C.

从前面的比较结果可以看出，根据w0、w1和w2可以分析出|A|比{|B|，|C|，|D|}几个数大。如果w0、w1和w2均为1，则表示|A|比|B|、|C|、|D|都大，为四个数中的最大值，因此选择A。如果w0、w1和w2中有两个1，则表示|A|是四个绝对值中的次大值，因此也选择A。否则，不选择A。因此，在一些实施例中，可以根据这些数值的出现次数来分析判断。As can be seen from the previous comparison results, according to w0, w1 and w2, it can be analyzed that |A| is larger than {|B|, |C|, |D|}. If w0, w1, and w2 are all 1, it means that |A| is larger than |B|, |C|, and |D|, which is the maximum value among the four numbers, so A is selected. If there are two 1s in w0, w1, and w2, it means |A| is the next largest of the four absolute values, so A is also chosen. Otherwise, do not choose A. Therefore, in some embodiments, analysis and judgment can be made according to the number of occurrences of these values.

在一种实现中，可以基于如下逻辑来选择有效数据元素。首先，可以统计各个数据大于其他数据的次数。例如，定义N_A＝sum_w＝w0+w1+w2，N_B＝sum_x＝x0+x1+x2，N_C＝sum_y＝y0+y1+y2，N_D＝sum_z＝z0+z1+z2。接着，按如下条件进行判断选择。In one implementation, valid data elements may be selected based on logic as follows. First, you can count the number of times each data is greater than other data. For example, define NA ₌ sum_w=w0+w1+w2, NB= _{sum_x} =x0+x1+x2, _NC =sum_y=y0+y1+y2, _ND =sum_z=z0+z1+z2. Next, the judgment selection is made according to the following conditions.

选择A的条件为：N_A＝3，或者N_A＝2且N_B/N_C/N_D中只有一个3；The conditions for selecting _A are: NA = 3, or NA ₌ 2 and there is only one 3 in _NB / _NC / _ND ;

选择B的条件为：N_B＝3，或者N_B＝2且N_A/N_C/N_D中只有一个3，且N_A≠2；The conditions for selecting B are: N _B =3, or N _B =2 and there is only one 3 in N _A /N _C /N _D , and N _A ≠2;

选择C的条件为：N_C＝3，且N_A/N_B中至多只有一个3，或者N_C＝2且N_A/N_B/N_D中只有一个3，且N_A/N_B中没有2；The conditions for choosing _C are: NC = 3 and there is at most one 3 in NA / _NB , or _NC ₌ 2 and there is only one 3 in NA _/ _NB / _ND and _none in NA / _NB 2;

选择D的条件为：N_D＝3，且N_A/N_B/N_C中至多只有一个3，或者N_D＝2且N_A/N_B/N_C中只有一个3，且N_A/N_B/N_C中没有2。The conditions for selecting D are: N _D =3, and there is at most one 3 in N _A /N _B /N _C , or N _D =2 and there is only one 3 in N _A /N _B /N _C , and N _A /N There is no 2 in _B /N _C.

本领域技术人员可以理解，为了确保按预定优先级选择，上述逻辑中存在一定的冗余。基于比较结果提供的大小及顺序信息，本领域技术人员可以设计其他逻辑来实现有效数据元素的筛选，本披露在此方面没有限制。由此，通过图9B的多级流水运算电路也可以实现四选二的结构化稀疏处理。Those skilled in the art can understand that, in order to ensure selection according to a predetermined priority, there is a certain redundancy in the above logic. Based on the size and order information provided by the comparison results, those skilled in the art can design other logic to realize the screening of valid data elements, and the present disclosure is not limited in this respect. Therefore, the structured sparse processing of selecting two out of four can also be implemented by the multi-stage pipeline operation circuit of FIG. 9B .

本领域技术人员可以理解，还可以设计其他形式的流水运算电路来实现结构化稀疏处理，本披露在此方面没有限制。Those skilled in the art can understand that other forms of pipeline operation circuits can also be designed to implement structured sparse processing, and the present disclosure is not limited in this respect.

如前面所提到的，稀疏指令的操作数可以是神经网络中的数据，例如权值、神经元等。神经网络中的数据通常包含多个维度。例如，在卷积神经网络中，数据可能存在四个维度：输入通道、输出通道、长度和宽度。在一些实施例中，上述稀疏指令可以用于神经网络中多维数据的至少一个维度的结构化稀疏处理。具体地，在一个实现中，稀疏指令可以用于对神经网络中多维数据的输入通道维度的结构化稀疏处理，例如在神经网络的推理过程中或前向训练过程中。在另一个实现中，稀疏指令可以用于对神经网络中的多维数据的输入通道维度和输出通道维度同时进行结构化稀疏处理，例如在神经网络的反向训练过程中。As mentioned earlier, the operands of sparse instructions can be data in the neural network, such as weights, neurons, etc. Data in a neural network usually contains multiple dimensions. For example, in a convolutional neural network, data may exist in four dimensions: input channels, output channels, length, and width. In some embodiments, the above-described sparse instructions may be used for structured sparse processing of at least one dimension of multidimensional data in a neural network. Specifically, in one implementation, the sparse instruction may be used for structured sparse processing of input channel dimensions of multidimensional data in a neural network, such as during inference or forward training of the neural network. In another implementation, the sparse instruction can be used to simultaneously perform structured sparse processing of the input channel dimension and the output channel dimension of multidimensional data in the neural network, such as during reverse training of the neural network.

在一个实施例中，响应于接收到的多个稀疏指令，本披露的一个或多个多级流水运算电路可以配置成执行多数据运算，例如执行单指令多数据(“SIMD”)指令。在另一个实施例中，根据多级运算流水线中逐级布置的多个运算电路所支持的功能来预先确定每级运算电路所执行的多个操作。In one embodiment, in response to receiving multiple sparse instructions, one or more multi-stage pipelined circuits of the present disclosure may be configured to perform multiple data operations, such as single instruction multiple data ("SIMD") instructions. In another embodiment, the plurality of operations performed by each stage of the operation circuit are predetermined according to the functions supported by the plurality of operation circuits arranged in stages in the multi-stage operation pipeline.

在本披露的上下文中，前述的多个稀疏指令可以是在一条或多条多级运算流水线内部运行的微指令或控制信号，其可以包括(或者说指示)一个或多个需多级运算流水线执行的运算操作。根据不同的运算操作场景，运算操作可以包括但不限于卷积操作、矩阵乘法操作等算术运算、与操作、异或操作、或操作等逻辑运算、移位操作，或者前述各类运算操作的任意多种组合。In the context of the present disclosure, the aforementioned plurality of sparse instructions may be micro-instructions or control signals running inside one or more multi-stage operation pipelines, which may include (or indicate) one or more multi-stage operation pipelines Operation performed. According to different operation scenarios, operation operations may include, but are not limited to, arithmetic operations such as convolution operations, matrix multiplication operations, logical operations such as AND operations, XOR operations, or operations, shift operations, or any of the aforementioned types of operations. Various combinations.

图10示出了根据本披露实施例的数据处理方法1000的示例性流程图。FIG. 10 shows an exemplary flowchart of a data processing method 1000 according to an embodiment of the present disclosure.

如图10所示，在步骤1010中，解析稀疏指令，该稀疏指令指示与结构化稀疏相关的操作，并且稀疏指令的至少一个操作数包括至少一个描述符，描述符指示以下至少一项信息：张量数据的形状信息和张量数据的空间信息。该步骤例如可以由图8的控制电路810来执行。As shown in FIG. 10, in step 1010, a sparse instruction is parsed, the sparse instruction indicates an operation related to structured sparse, and at least one operand of the sparse instruction includes at least one descriptor, and the descriptor indicates at least one of the following information: Shape information for tensor data and spatial information for tensor data. This step may be performed, for example, by the control circuit 810 of FIG. 8 .

接着，在步骤1020中，对描述符进行解析。该步骤例如可以由图8的张量接口电路812来执行。具体地，可以根据张量数据的形状信息，确定与操作数对应的张量数据在数据存储空间中的数据地址；和/或根据张量数据的空间信息，确定指令之间的依赖关系。Next, in step 1020, the descriptor is parsed. This step may be performed, for example, by the tensor interface circuit 812 of FIG. 8 . Specifically, the data address of the tensor data corresponding to the operand in the data storage space can be determined according to the shape information of the tensor data; and/or the dependency relationship between the instructions can be determined according to the space information of the tensor data.

接着，在步骤1030中，至少部分基于解析的描述符，读取相应的操作数。操作数是张量数据时，可以根据解析的描述符来获得数据地址，从而读取相应的数据。稀疏指令可以指示不同的操作模式，相应的操作数也有不同，后面将详细描述。该步骤例如可以由图8的控制电路810针对存储电路820来执行。Next, in step 1030, based at least in part on the parsed descriptor, the corresponding operand is read. When the operand is tensor data, the data address can be obtained according to the parsed descriptor, so as to read the corresponding data. Sparse instructions can indicate different operation modes, and the corresponding operands are also different, which will be described in detail later. This step may be performed by the control circuit 810 of FIG. 8 for the storage circuit 820, for example.

接着，在步骤1030中，对读取的操作数执行与结构化稀疏相关的操作。该步骤例如可以由图8的运算电路830来执行。Next, in step 1030, operations related to structured sparseness are performed on the read operands. This step can be performed, for example, by the arithmetic circuit 830 of FIG. 8 .

最后，在步骤1040中，输出操作结果。例如，可以由运算电路830将操作结果输出到存储电路820以供随后使用。Finally, in step 1040, the operation result is output. For example, the operation result may be output by the arithmetic circuit 830 to the storage circuit 820 for subsequent use.

与结构化稀疏相关的操作可以存在多种，例如结构化稀疏处理、反稀疏化处理等。可以设计多种指令方案来实现与结构化稀疏相关的操作。There can be various operations related to structured sparsity, such as structured sparsification processing, anti-sparse processing, etc. Various instruction schemes can be devised to implement operations related to structured sparsity.

在一种方案中，可以设计一条稀疏指令，指令中可以包括操作模式位来指示稀疏指令的不同操作模式，从而执行不同的操作。In one solution, a sparse instruction may be designed, and an operation mode bit may be included in the instruction to indicate different operation modes of the sparse instruction, thereby performing different operations.

在另一种方案中，可以设计多条稀疏指令，每条指令对应一种或多种不同的操作模式，从而执行不同的操作。在一种实现中，可以针对每种操作模式设计一条对应的稀疏指令。在另一种实现中，可以根据操作模式的特性分类，针对每类操作模式设计一条稀疏指令。进一步地，当某一类操作模式中包括多个操作模式时，可以在稀疏指令中包括操作模式位来指示相应的操作模式。In another solution, multiple sparse instructions may be designed, each instruction corresponding to one or more different operation modes, thereby performing different operations. In one implementation, a corresponding sparse instruction may be designed for each mode of operation. In another implementation, one sparse instruction may be designed for each type of operation mode according to the characteristics of the operation mode. Further, when a certain type of operation mode includes multiple operation modes, an operation mode bit may be included in the sparse instruction to indicate the corresponding operation mode.

无论采取哪种方案，稀疏指令可以通过操作模式位和/或指令本身来指示其对应的操作模式。Whichever scheme is adopted, the sparse instruction can indicate its corresponding operating mode through the operating mode bit and/or the instruction itself.

在一个实施例中，稀疏指令可以指示第一操作模式。在第一操作模式中，稀疏指令的操作数包括待稀疏化数据。此时，运算电路830可以配置用于根据稀疏指令，对该待稀疏化数据执行结构化稀疏处理，并向存储电路820输出稀疏化处理后的结构体。In one embodiment, the sparse instruction may indicate a first mode of operation. In the first mode of operation, the operands of the thinning instruction include the data to be thinned. At this time, the arithmetic circuit 830 may be configured to perform structured thinning processing on the data to be thinned out according to the thinning instruction, and output the thinned-out structure to the storage circuit 820 .

第一操作模式中的结构化稀疏处理可以是预定筛选规则的结构化稀疏处理，例如按照筛选绝对值较大的规则，从每m个数据元素中筛选出n个绝对值较大的数据元素作为有效数据元素。运算电路830例如可以配置成参考图9描述的流水运算电路来执行此结构化稀疏处理。The structured sparse processing in the first operation mode may be structured sparse processing of predetermined filtering rules, for example, according to the filtering rule with a relatively large absolute value, n data elements with a relatively large absolute value are selected from every m data elements as Valid data elements. The arithmetic circuit 830 may, for example, be configured as the pipeline arithmetic circuit described with reference to FIG. 9 to perform this structured thinning process.

稀疏化处理后的结果包括两部分：数据部分和索引部分。数据部分包括待稀疏化数据经稀疏化处理后的数据，也即根据结构化稀疏处理的筛选规则提取出的有效数据元素。索引部分用于指示稀疏化后的数据，也即有效数据元素在稀疏化前的数据(也即，待稀疏化数据)中的位置。The result after sparse processing consists of two parts: the data part and the index part. The data part includes the data after sparse processing of the data to be sparse, that is, the valid data elements extracted according to the filtering rules of the structured sparse processing. The index part is used to indicate the data after thinning, that is, the positions of valid data elements in the data before thinning (that is, the data to be thinned).

本披露实施例中的结构体包括相互绑定的数据部分和索引部分。在一些实施例中，索引部分中每1比特可以对应一个数据元素。例如，当数据类型是fix8时，一个数据元素为8比特，则索引部分中每1比特可以对应8比特的数据。在另一些实施例中，考虑到后续在使用结构体时硬件层面的实现，可以将结构体中的索引部分中每1比特设定为对应N比特数据的位置，N至少部分基于硬件配置确定。例如，可以设置为结构体中的索引部分的每1比特对应于4比特数据的位置。例如，当数据类型是fix8时，索引部分中每2比特对应于一个fix8类型的数据元素。在一些实施例中，结构体中的数据部分可以按照第一对齐要求对齐，结构体中的索引部分可以按照第二对齐要求对齐，从而整个结构体也满足对齐要求。例如，数据部分可以按照64B对齐，索引部分可以按照32B对齐，整个结构体则按照96B(64B+32B)对齐。通过这种对齐要求，在后续使用时可以减少访存次数，提高处理效率。The structure in the embodiment of the present disclosure includes a data part and an index part that are bound to each other. In some embodiments, each 1 bit in the index portion may correspond to one data element. For example, when the data type is fix8, and one data element is 8 bits, each 1 bit in the index part may correspond to 8 bits of data. In other embodiments, in consideration of the implementation at the hardware level when the structure is used subsequently, each 1 bit in the index part of the structure may be set as a position corresponding to N-bit data, and N is determined at least in part based on the hardware configuration. For example, it can be set so that every 1 bit of the index part in the structure corresponds to the position of 4-bit data. For example, when the data type is fix8, every 2 bits in the index part corresponds to a data element of type fix8. In some embodiments, the data part in the structure may be aligned according to the first alignment requirement, and the index part in the structure may be aligned according to the second alignment requirement, so that the entire structure also meets the alignment requirement. For example, the data part can be aligned according to 64B, the index part can be aligned according to 32B, and the entire structure can be aligned according to 96B (64B+32B). Through this alignment requirement, the number of memory accesses can be reduced in subsequent use, and the processing efficiency can be improved.

通过使用这种结构体，数据部分和索引部分可以统一使用。由于结构化稀疏处理中，有效数据元素占据所有数据元素的比例是固定的，例如n/m，因此，稀疏化处理后的数据大小也是固定或可预期的。从而，结构体可以在存储电路中致密存储而无性能损失。By using this structure, the data part and the index part can be used uniformly. Since the proportion of valid data elements occupying all data elements is fixed in structured sparse processing, such as n/m, the size of data after sparse processing is also fixed or predictable. Thus, structures can be densely stored in memory circuits without performance loss.

在另一实施例中，稀疏指令可以指示第二操作模式。第二操作模式与第一操作模式的区别在于输出的内容不同，第二操作模式仅输出结构化稀疏处理后的数据部分，而不输出索引部分。In another embodiment, the sparse instruction may indicate a second mode of operation. The difference between the second operation mode and the first operation mode is that the output content is different. The second operation mode only outputs the data part after the structured sparse processing, but does not output the index part.

类似地，在第二操作模式中，稀疏指令的操作数包括待稀疏化数据。此时，运算电路830可以配置用于根据稀疏指令，对该待稀疏化数据执行结构化稀疏处理，并向存储电路820输出稀疏化处理后的数据部分。该数据部分包括待稀疏化数据经稀疏化处理后的数据。数据部分在存储电路中致密存储。输出的数据部分按照n个元素对齐。例如，在m＝4，n＝2的示例中，输入的待稀疏化数据是按4个元素对齐的，而输出的数据部分则是按2个元素对齐的。Similarly, in the second mode of operation, the operands of the thinning instruction include the data to be thinned out. At this time, the arithmetic circuit 830 may be configured to perform structured thinning processing on the data to be thinned out according to the thinning instruction, and output the thinned-out data portion to the storage circuit 820 . The data part includes the data to be thinned out after the data to be thinned out. The data portion is densely stored in the memory circuit. The output data portion is aligned by n elements. For example, in the example of m=4, n=2, the input data to be thinned is aligned by 4 elements, and the output data part is aligned by 2 elements.

在又一实施例中，稀疏指令可以指示第三操作模式。第三操作模式与第一操作模式的区别在于输出的内容不同，第三操作模式仅输出结构化稀疏处理后的索引部分，而不输出数据部分。In yet another embodiment, the sparse instruction may indicate a third mode of operation. The difference between the third operation mode and the first operation mode is that the output content is different. The third operation mode only outputs the index part after the structured sparse processing, but does not output the data part.

类似地，在第三操作模式中，稀疏指令的操作数包括待稀疏化数据。此时，运算电路830可以配置用于根据稀疏指令，对该待稀疏化数据执行结构化稀疏处理，并向存储电路820输出稀疏化处理后的索引部分。该索引部分指示稀疏化后的数据在待稀疏化数据中的原始位置。索引部分在存储电路中致密存储。输出的索引部分中每1比特对应一个数据元素的位置。由于索引部分可以单独使用，例如用于后续卷积处理中对神经元的结构化稀疏，而神经元的数据类型可能不确定，因此，通过将索引部分中每1比特对应一个数据元素的位置，该独立存储的索引部分可以适用于各种数据类型。Similarly, in the third mode of operation, the operands of the thinning instruction include the data to be thinned out. At this time, the arithmetic circuit 830 may be configured to perform structured thinning processing on the data to be thinned out according to the thinning instruction, and output the thinned-out index part to the storage circuit 820 . The index part indicates the original position of the thinned data in the data to be thinned. The index portion is densely stored in the memory circuit. Each 1 bit in the output index part corresponds to the position of a data element. Since the index part can be used alone, for example, for the structured sparseness of neurons in subsequent convolution processing, and the data type of the neuron may be uncertain, therefore, by assigning every 1 bit in the index part to the position of a data element, The index portion of the independent storage can be adapted to various data types.

在再一实施例中，稀疏指令可以指示第四操作模式。第四操作模式与第一操作模式的区别在于，第四操作模式指定了结构化稀疏处理的筛选规则，而不是按照预定的筛选规则(例如，前面的绝对值较大规则)进行结构化稀疏处理。此时，稀疏指令的操作数有两个：待稀疏化数据和稀疏索引。增加的稀疏索引这一操作数，用来指示将要执行的结构化稀疏中有效数据元素的位置，也即指定结构化稀疏处理的筛选规则。稀疏索引中每1比特对应一个数据元素的位置，因而可以适用于各种数据类型的待稀疏化数据。In yet another embodiment, the sparse instruction may indicate a fourth mode of operation. The difference between the fourth operation mode and the first operation mode is that the fourth operation mode specifies a filtering rule for structured sparse processing, instead of performing structured sparse processing according to a predetermined filtering rule (for example, the preceding rule with a larger absolute value). . At this time, there are two operands of the sparse instruction: the data to be sparsed and the sparse index. The operand of the increased sparse index is used to indicate the position of valid data elements in the structured sparse to be performed, that is, to specify the filtering rules for structured sparse processing. Each bit in the sparse index corresponds to the position of a data element, so it can be applied to data to be sparsed of various data types.

在第四操作模式中，运算电路830可以配置用于根据稀疏指令，按照稀疏索引指示的位置，对待稀疏化数据执行结构化稀疏处理，并向存储电路输出稀疏化处理后的结果。在一种实现中，输出结果可以是稀疏化处理后的结构体。在另一种实现中，输出结果可以是稀疏化处理后的数据部分。In the fourth operation mode, the arithmetic circuit 830 may be configured to perform structured sparse processing on the data to be sparsed according to the sparse instruction, according to the position indicated by the sparse index, and output the result of the sparse processing to the storage circuit. In one implementation, the output result may be a sparse-processed structure. In another implementation, the output result may be the thinned-out data portion.

结构体的含义与第一操作模式中的相同，其包括相互绑定的数据部分和索引部分，数据部分包括待稀疏化数据经稀疏化处理后的数据，索引部分用于指示稀疏化后的数据在待稀疏化数据中的原始位置。结构体中对数据部分、索引部分的对齐要求、对应关系等与第一操作模式中的相同，此处不再重复。The meaning of the structure is the same as that in the first operation mode, it includes a data part and an index part bound to each other, the data part includes the data after sparse processing of the data to be sparse, and the index part is used to indicate the sparse data. The original position in the data to be thinned. The alignment requirements and corresponding relationships of the data part and the index part in the structure are the same as those in the first operation mode, and are not repeated here.

上面四种操作模式提供了对数据的结构化稀疏处理，例如按照预定筛选规则或者按照指令的操作数所指定的筛选规则进行处理，并且分别提供了不同的输出内容，例如输出结构体、只输出数据部分、只输出索引部分等。上述指令设计可以很好地支持结构化稀疏处理，并提供了多种输出选择，以适应不同的场景需求，例如需要数据与索引绑定使用时，可以选择输出结构体，而在需要单独使用索引部分或数据部分时，可以选择只输出索引部分或数据部分。The above four operation modes provide structured sparse processing of data, such as processing according to predetermined filtering rules or filtering rules specified by the operands of instructions, and provide different output contents, such as output structure, output only Data part, output only the index part, etc. The above instruction design can well support structured sparse processing, and provides a variety of output options to meet the needs of different scenarios. For example, when the data needs to be bound to the index, the output structure can be selected, and when the index needs to be used separately part or data part, you can choose to output only the index part or the data part.

在又一实施例中，稀疏指令可以指示第五操作模式。第五操作模式不需要进行结构化稀疏处理，只需要将分离或独立的数据部分与索引部分绑定成结构体。In yet another embodiment, the sparse instruction may indicate a fifth mode of operation. The fifth operation mode does not require structured sparse processing, and only needs to bind separate or independent data parts and index parts into a structure.

在第五操作模式中，稀疏指令的操作数包括经稀疏化处理后的数据部分和对应的索引部分。数据部分和索引部分分别是致密存储格式，但是未进行绑定。输入的数据部分按照n个元素对齐。例如，在m＝4，n＝2的示例中，输入的数据部分是按2个元素对齐的。索引部分指示数据部分在稀疏化处理前的数据中的原始位置，其中索引部分的每1比特对应于一个数据元素。In the fifth mode of operation, the operand of the thinning instruction includes the thinned-out data portion and the corresponding index portion. The data part and the index part are in a compact storage format, respectively, but are not bound. The input data portion is aligned by n elements. For example, in the example of m=4, n=2, the input data portion is aligned by 2 elements. The index part indicates the original position of the data part in the data before the thinning process, wherein each 1 bit of the index part corresponds to one data element.

此时，运算电路830可以配置用于根据稀疏指令，将数据部分与索引部分绑定成结构体，并向存储电路输出该结构体。结构体的含义、对数据部分、索引部分的对齐要求、对应关系等与第一操作模式中的相同，此处不再重复。取决于数据元素的数据类型，需要基于数据类型以及结构体中索引部分的比特对应关系，来相应地生成结构体中的索引部分。例如，当输入的索引部分为0011，其中每1比特对应一个数据元素，如果数据类型为fix8，也即每一个数据元素具有8比特，则根据结构体中索引部分每1比特对应4比特数据的对应关系，结构体中的索引部分应为：00001111，也即2比特对应一个数据元素。At this time, the arithmetic circuit 830 may be configured to bind the data part and the index part into a structure according to the sparse instruction, and output the structure to the storage circuit. The meaning of the structure, the alignment requirements for the data part and the index part, and the corresponding relationship are the same as those in the first operation mode, and will not be repeated here. Depending on the data type of the data element, the index part in the structure needs to be correspondingly generated based on the data type and the bit correspondence of the index part in the structure. For example, when the input index part is 0011, in which every 1 bit corresponds to a data element, if the data type is fix8, that is, each data element has 8 bits, according to the index part in the structure, every 1 bit corresponds to 4 bits of data. Corresponding relationship, the index part in the structure should be: 00001111, that is, 2 bits correspond to one data element.

在再一实施例中，稀疏指令可以指示第六操作模式。第六操作模式用于执行反稀疏化处理，也即将稀疏化后的数据恢复成稀疏化前的数据格式或规模。In yet another embodiment, the sparse instruction may indicate a sixth mode of operation. The sixth operation mode is used to perform de-sparse processing, that is, to restore the data after sparseness to the data format or scale before the sparseness.

在第六操作模式中，稀疏指令的操作数包括经稀疏化处理后的数据部分和对应的索引部分，数据部分和索引部分分别是致密存储格式，但是未进行绑定。输入的数据部分按照n个元素对齐。例如，在m＝4，n＝2的示例中，输入的数据部分是按2个元素对齐的，输出的数据则按4个元素对齐。索引部分指示数据部分在稀疏化处理前的数据中的原始位置，其中索引部分的每1比特对应于一个数据元素。In the sixth operation mode, the operand of the sparse instruction includes the data part and the corresponding index part after the sparse process, and the data part and the index part are respectively in a compact storage format, but are not bound. The input data portion is aligned by n elements. For example, in the example of m=4, n=2, the input data part is aligned by 2 elements, and the output data is aligned by 4 elements. The index part indicates the original position of the data part in the data before the thinning process, wherein each 1 bit of the index part corresponds to one data element.

此时，运算电路830可以配置用于根据稀疏指令，按照输入的索引部分指示的位置，对输入的数据部分执行反稀疏化处理，以生成具有稀疏化处理前的数据格式的恢复数据，并向存储电路输出该恢复数据。At this time, the arithmetic circuit 830 may be configured to perform anti-thinning processing on the input data part according to the position indicated by the input index part according to the thinning instruction, so as to generate the restored data having the data format before the thinning processing, and send it to the The storage circuit outputs the restored data.

在一种实现中，反稀疏化处理可以包括：根据索引部分指示的位置，按照稀疏化处理前的数据格式，将数据部分中的各个数据元素分别放置在稀疏化处理前的数据格式的对应位置，以及在数据格式的其余位置处填充预定信息(例如，填充0)以生成恢复数据。In one implementation, the anti-sparse processing may include: according to the position indicated by the index part, and according to the data format before the thinning processing, each data element in the data part is respectively placed in the corresponding position of the data format before the thinning processing , and padded with predetermined information (eg, padded with 0s) at the remaining positions of the data format to generate recovery data.

从上面描述可知，本披露实施例提供了一种稀疏指令，用于执行与结构化稀疏相关的操作。这些操作可以包括正向的结构化稀疏操作，也可以包括反稀疏化操作，还可以包括一些相关的格式转换操作。在一些实施例中，稀疏指令中可以包括操作模式位来指示稀疏指令的不同操作模式，从而执行不同的操作。在另一些实施例中，可以直接提供多条稀疏指令，每条指令对应一种或多种不同的操作模式，从而执行与结构化稀疏相关的各种操作。通过提供专门的稀疏指令来执行与结构化稀疏相关的操作，可以简化处理，由此提高机器的处理效率。As can be seen from the above description, an embodiment of the present disclosure provides a sparse instruction for performing operations related to structured sparse. These operations can include forward structured sparse operations, anti-sparse operations, and some related format conversion operations. In some embodiments, operation mode bits may be included in the sparse instruction to indicate different operation modes of the sparse instruction to perform different operations. In other embodiments, multiple sparse instructions may be directly provided, each instruction corresponding to one or more different operation modes, so as to perform various operations related to structured sparse. By providing specialized sparse instructions to perform operations related to structured sparse, processing can be simplified, thereby increasing the processing efficiency of the machine.

根据不同的应用场景，本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆；所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机；所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步，本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中，根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器)，而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中，云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容，从而可以根据终端设备和/或边缘端设备的硬件信息，从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源，以便完成端云一体或云边端一体的统一管理、调度和协同工作。According to different application scenarios, the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment. The vehicles include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph. The electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal. In one or more embodiments, the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.

需要说明的是，为了简明的目的，本披露将一些方法及其实施例表述为一系列的动作及其组合，但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此，依据本披露的公开或教导，本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步，本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例，即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外，根据方案的不同，本披露对一些实施例的描述也各有侧重。鉴于此，本领域技术人员可以理解本披露某个实施例中没有详述的部分，也可以参见其他实施例的相关描述。It should be noted that, for the purpose of simplicity, the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.

在具体实现方面，基于本披露的公开和教导，本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如，就前文所述的电子设备或装置实施例中的各个单元来说，本文在考虑了逻辑功能的基础上对其进行拆分，而实际实现时也可以有另外的拆分方式。又例如，可以将多个单元或组件结合或者集成到另一个系统，或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言，前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中，前述的直接或间接耦合涉及利用接口的通信连接，其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。In terms of specific implementation, based on the disclosure and teaching of this disclosure, those skilled in the art can understand that several embodiments disclosed in this disclosure can also be implemented in other ways not disclosed herein. For example, as for each unit in the foregoing electronic device or apparatus embodiment, it is divided on the basis of considering the logical function, and there may also be other division methods in actual implementation. As another example, multiple units or components may be combined or integrated into another system, or some features or functions of a unit or component may be selectively disabled. As far as the connection relationship between different units or components is concerned, the connections discussed above in conjunction with the accompanying drawings may be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

在本披露中，作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外，根据实际的需要，可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外，在一些场景中，本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。In this disclosure, units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units. The aforementioned components or elements may be co-located or distributed over multiple network elements. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.

在另外一些实现场景中，上述集成的单元也可以采用硬件的形式实现，即为具体的硬件电路，其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件，而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此，本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现，例如中央处理器、GPU、FPGA、DSP和ASIC等。进一步，前述的所述存储电路或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等)，其例如可以是可变电阻式存储器(Resistive Random Access Memory，RRAM)、动态随机存取存储器(Dynamic RandomAccess Memory，DRAM)、静态随机存取存储器(Static Random Access Memory，SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory，EDRAM)、高带宽存储器(High Bandwidth Memory，HBM)、混合存储器立方体(Hybrid Memory Cube，HMC)、ROM和RAM等。In other implementation scenarios, the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (eg, computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage circuit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (Enhanced Dynamic Random Access Memory, EDRAM), High Bandwidth Memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.

依据以下条款可更好地理解前述内容：The foregoing can be better understood in accordance with the following terms:

条款1、一种数据处理装置，包括：Clause 1. A data processing device comprising:

控制电路，其配置用于解析稀疏指令，所述稀疏指令指示与结构化稀疏相关的操作，并且所述稀疏指令的至少一个操作数包括至少一个描述符，所述描述符指示以下至少一项信息：张量数据的形状信息和张量数据的空间信息；a control circuit configured to parse a sparse instruction, the sparse instruction indicating an operation related to structured sparse, and at least one operand of the sparse instruction includes at least one descriptor indicating at least one of the following information : shape information of tensor data and spatial information of tensor data;

张量接口电路，其配置用于对所述描述符进行解析；Tensor interface circuitry configured to parse the descriptor;

存储电路，其配置用于存储稀疏化前和/或稀疏化后的信息；以及a storage circuit configured to store information before and/or after thinning; and

运算电路，其配置用于基于解析的描述符，根据所述稀疏指令执行相应的操作。An arithmetic circuit configured to perform a corresponding operation according to the sparse instruction based on the parsed descriptor.

条款2、根据条款1所述的数据处理装置，其中，Clause 2. The data processing apparatus of clause 1, wherein,

所述张量接口电路配置用于根据所述形状信息，确定与所述操作数对应的张量数据在数据存储空间中的数据地址；和/或The tensor interface circuit is configured to determine, according to the shape information, a data address in the data storage space of the tensor data corresponding to the operand; and/or

所述张量接口电路配置用于根据所述空间信息，确定指令之间的依赖关系。The tensor interface circuit is configured to determine dependencies between instructions according to the spatial information.

条款3、根据条款1-2任一所述的数据处理装置，其中所述张量数据的形状信息包括表示N维张量数据的形状的至少一个形状参数，N为正整数，所述张量数据的形状参数包括以下至少一种：Item 3. The data processing apparatus according to any one of Items 1-2, wherein the shape information of the tensor data includes at least one shape parameter representing the shape of N-dimensional tensor data, N is a positive integer, and the tensor The shape parameters of the data include at least one of the following:

所述张量数据所在的数据存储空间在N个维度方向的至少一个方向上的尺寸、所述张量数据的存储区域在N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的偏移量、处于N个维度方向的对角位置的至少两个顶点相对于数据基准点的位置、所述张量数据的数据描述位置与数据地址之间的映射关系。The size of the data storage space in which the tensor data is located in at least one of the N dimensions, the size of the storage area of the tensor data in at least one of the N dimensions, the size of the storage area in N The offset in at least one direction of the N dimension directions, the position of at least two vertices at the diagonal positions of the N dimension directions relative to the data reference point, the difference between the data description position and the data address of the tensor data Mapping relations.

条款4、根据条款1-2任一所述的数据处理装置，其中所述张量数据的形状信息指示包括多个数据块的N维张量数据的形状的至少一个形状参数，N为正整数，所述形状参数包括以下至少一种：Clause 4. The data processing apparatus according to any one of clauses 1-2, wherein the shape information of the tensor data indicates at least one shape parameter of the shape of the N-dimensional tensor data comprising a plurality of data blocks, and N is a positive integer , the shape parameters include at least one of the following:

所述张量数据所在的数据存储空间在N个维度方向的至少一个方向上的尺寸、单个数据块的存储区域在N个维度方向的至少一个方向上的尺寸、所述数据块在N个维度方向的至少一个方向上的分块步长、N个维度方向的至少一个方向上的数据块数量、所述数据块在N个维度方向的至少一个方向上的整体步长。The size of the data storage space where the tensor data is located in at least one of the N dimensions, the size of the storage area of a single data block in at least one of the N dimensions, the size of the data block in at least one of the N dimensions The block step size in at least one of the directions, the number of data blocks in at least one of the N-dimensional directions, and the overall step size of the data blocks in at least one of the N-dimensional directions.

条款5、根据条款1-4任一所述的数据处理装置，其中所述稀疏指令指示第一操作模式，并且所述稀疏指令的操作数包括待稀疏化数据，Clause 5. The data processing apparatus of any one of clauses 1-4, wherein the sparse instruction indicates a first mode of operation, and an operand of the sparse instruction includes data to be sparsed,

所述运算电路配置用于根据所述稀疏指令，对所述待稀疏化数据执行结构化稀疏处理，并向所述存储电路输出稀疏化处理后的结构体，所述结构体包括相互绑定的数据部分和索引部分，所述数据部分包括所述待稀疏化数据经稀疏化处理后的数据，所述索引部分用于指示稀疏化后的数据在所述待稀疏化数据中的位置。The arithmetic circuit is configured to perform structured sparse processing on the data to be sparsed according to the sparse instruction, and output a sparse-processed structure to the storage circuit, where the structure includes mutually bound structures. A data part and an index part, where the data part includes the thinned data of the data to be thinned, and the index part is used to indicate the position of the thinned data in the data to be thinned.

条款6、根据条款1-4任一所述的数据处理装置，其中所述稀疏指令指示第二操作模式，并且所述稀疏指令的操作数包括待稀疏化数据，Clause 6. The data processing apparatus of any of clauses 1-4, wherein the sparse instruction indicates a second mode of operation, and an operand of the sparse instruction includes data to be sparsed,

所述运算电路配置用于根据所述稀疏指令，对所述待稀疏化数据执行结构化稀疏处理，并向所述存储电路输出稀疏化处理后的数据部分，所述数据部分包括所述待稀疏化数据经稀疏化处理后的数据。The arithmetic circuit is configured to perform structured thinning processing on the data to be thinned according to the thinning instruction, and output a thinned data portion to the storage circuit, the data portion including the data to be thinned Data after sparse processing.

条款7、根据条款1-4任一所述的数据处理装置，其中所述稀疏指令指示第三操作模式，并且所述稀疏指令的操作数包括待稀疏化数据，Clause 7. The data processing apparatus of any one of clauses 1-4, wherein the sparse instruction indicates a third mode of operation, and an operand of the sparse instruction includes data to be sparsed,

所述运算电路配置用于根据所述稀疏指令，对所述待稀疏化数据执行结构化稀疏处理，并向所述存储电路输出稀疏化处理后的索引部分，所述索引部分指示稀疏化后的数据在所述待稀疏化数据中的位置。The arithmetic circuit is configured to perform structured thinning processing on the data to be thinned according to the thinning instruction, and output a thinned-out index portion to the storage circuit, where the index portion indicates the thinned-out data. The position of the data in the data to be thinned.

条款8、根据条款1-4任一所述的数据处理装置，其中所述稀疏指令指示第四操作模式，并且所述稀疏指令的操作数包括待稀疏化数据和稀疏索引，所述稀疏索引指示将要执行的结构化稀疏中有效数据元素的位置，Clause 8. The data processing apparatus of any of clauses 1-4, wherein the sparse instruction indicates a fourth mode of operation, and operands of the sparse instruction include data to be sparsed and a sparse index, the sparse index indicating the positions of valid data elements in the structured sparse that will be performed,

所述运算电路配置用于根据所述稀疏指令，按照所述稀疏索引指示的位置，对所述待稀疏化数据执行结构化稀疏处理，并向所述存储电路输出稀疏化处理后的结构体或稀疏化处理后的数据部分，所述结构体包括相互绑定的数据部分和索引部分，所述数据部分包括所述待稀疏化数据经稀疏化处理后的数据，所述索引部分用于指示稀疏化后的数据在所述待稀疏化数据中的位置。The operation circuit is configured to perform structured sparse processing on the data to be sparsed according to the sparse instruction and according to the position indicated by the sparse index, and output the sparse-processed structure or the data to the storage circuit. The data part after sparse processing, the structure includes a data part and an index part bound to each other, the data part includes the data after the sparse processing of the data to be sparse, and the index part is used to indicate the sparseness The position of the thinned data in the data to be thinned.

条款9、根据条款1-4任一所述的数据处理装置，其中所述稀疏指令指示第五操作模式，并且所述稀疏指令的操作数包括经稀疏化处理后的数据部分和对应的索引部分，所述索引部分指示所述数据部分在稀疏化处理前的数据中的位置，Clause 9. The data processing apparatus of any of clauses 1-4, wherein the sparse instruction indicates a fifth mode of operation, and an operand of the sparse instruction includes a sparse-processed data portion and a corresponding index portion , the index part indicates the position of the data part in the data before the thinning process,

所述运算电路配置用于根据所述稀疏指令，将所述数据部分与所述索引部分绑定成结构体，并向所述存储电路输出所述结构体。The arithmetic circuit is configured to bind the data portion and the index portion into a structure according to the sparse instruction, and output the structure to the storage circuit.

条款10、根据条款1-4任一所述的数据处理装置，其中所述稀疏指令指示第六操作模式，并且所述稀疏指令的操作数包括经稀疏化处理后的数据部分和对应的索引部分，所述索引部分指示所述数据部分在稀疏化处理前的数据中的位置，Clause 10. The data processing apparatus of any of clauses 1-4, wherein the sparse instruction indicates a sixth mode of operation, and an operand of the sparse instruction includes a thinned-out data portion and a corresponding index portion , the index part indicates the position of the data part in the data before the thinning process,

所述运算电路配置用于根据所述稀疏指令，按照所述索引部分指示的位置，对所述数据部分执行反稀疏化处理，以生成具有稀疏化处理前的数据格式的恢复数据，并向所述存储电路输出所述恢复数据。The arithmetic circuit is configured to, according to the thinning instruction, perform de-sparse processing on the data portion according to the position indicated by the index portion, so as to generate restored data having the data format before the thinning processing, and send it to the data portion. The storage circuit outputs the restored data.

条款11、根据条款5-8任一所述的数据处理装置，其中所述结构化稀疏处理包括从每m个数据元素中选择n个数据元素作为有效数据元素，其中m>n。Clause 11. The data processing apparatus of any of clauses 5-8, wherein the structured sparse processing comprises selecting n data elements from every m data elements as valid data elements, where m>n.

条款12、根据条款11所述的数据处理装置，其中所述运算电路进一步包括：至少一个多级流水运算电路，其包括逐级布置的多个运算器并且配置成根据所述稀疏指令来执行从m个数据元素中选择n个绝对值较大的数据元素作为有效数据元素的结构化稀疏处理。Clause 12. The data processing apparatus of Clause 11, wherein the arithmetic circuit further comprises: at least one multi-stage pipeline arithmetic circuit comprising a plurality of operators arranged in stages and configured to perform a slave operation in accordance with the sparse instruction. The structured sparse processing of selecting n data elements with larger absolute values from m data elements as valid data elements.

条款13、根据条款12所述的数据处理装置，其中所述多级流水运算电路包括四个流水级，其中：Clause 13. The data processing apparatus of clause 12, wherein the multi-stage pipelined circuit includes four pipeline stages, wherein:

第一流水级包括m个求绝对值运算器，用于分别对待稀疏化的m个数据元素取绝对值，以生成m个绝对值；The first pipeline stage includes m absolute value operators for respectively taking absolute values of m data elements to be sparsed to generate m absolute values;

第二流水级包括排列组合电路，用于对所述m个绝对值进行排列组合，以生成m组数据，其中每组数据均包括所述m个绝对值并且所述m个绝对值在各组数据中的位置互不相同；The second pipeline stage includes a permutation and combination circuit for permuting and combining the m absolute values to generate m sets of data, wherein each set of data includes the m absolute values and the m absolute values are in each set The locations in the data are different from each other;

第三流水级包括m路比较电路，用于对所述m组数据中的绝对值进行比较并生成比较结果；以及The third pipeline stage includes m comparison circuits for comparing absolute values in the m sets of data and generating a comparison result; and

第四流水级包括筛选电路，用于根据所述比较结果选择n个绝对值较大的数据元素作为有效数据元素，以及输出所述有效数据元素及对应的索引，所述索引指示所述有效数据元素在所述m个数据元素中的位置。The fourth pipeline stage includes a screening circuit for selecting n data elements with larger absolute values as valid data elements according to the comparison result, and outputting the valid data elements and a corresponding index, the index indicating the valid data the position of the element within the m data elements.

条款14、根据条款13所述的数据处理装置，其中所述第三流水级中每路比较电路包括m-1个比较器，第i路比较电路中的m-1个比较器用于将第i组数据中的一个绝对值与其他三个绝对值依次比较并生成比较结果，1≤i≤m。Item 14. The data processing apparatus according to Item 13, wherein each comparison circuit in the third pipeline stage includes m-1 comparators, and the m-1 comparators in the i-th comparison circuit are used to compare the i-th comparison circuit. One absolute value in the group data is compared with the other three absolute values in turn and a comparison result is generated, 1≤i≤m.

条款15、根据条款13-14任一所述的数据处理装置，其中所述筛选电路进一步配置用于，当存在绝对值相同的数据元素时，按照指定的优先级顺序进行选择。Clause 15. The data processing apparatus of any one of clauses 13-14, wherein the screening circuit is further configured to select, when there are data elements with the same absolute value, in a specified priority order.

条款16、根据条款10所述的数据处理装置，其中所述反稀疏化处理包括：Clause 16. The data processing apparatus of clause 10, wherein the anti-thinning process comprises:

根据所述索引部分指示的位置，按照稀疏化处理前的数据格式，将所述数据部分中的各个数据元素分别放置在稀疏化处理前的数据格式的对应位置，以及在所述数据格式的其余位置处填充预定信息以生成所述恢复数据。According to the position indicated by the index part, according to the data format before the thinning process, each data element in the data part is respectively placed in the corresponding position of the data format before the thinning process, and in the rest of the data format. The location is populated with predetermined information to generate the recovery data.

条款17、根据条款5、8或9所述的数据处理装置，其中Clause 17. A data processing device according to clause 5, 8 or 9, wherein

所述结构体中的索引部分中每1比特对应N比特数据的位置，N至少部分基于硬件配置确定；和/或Each 1 bit in the index portion of the structure corresponds to the position of N bits of data, where N is determined at least in part based on hardware configuration; and/or

所述结构体中的数据部分按照第一对齐要求对齐，所述结构体中的索引部分按照第二对齐要求对齐。The data part in the structure is aligned according to the first alignment requirement, and the index part in the structure is aligned according to the second alignment requirement.

条款18、根据条款1-17任一所述的数据处理装置，其中所述稀疏指令用于神经网络中多维数据的至少一个维度的结构化稀疏处理。Clause 18. The data processing apparatus of any of clauses 1-17, wherein the sparse instruction is for structured sparse processing of at least one dimension of multidimensional data in a neural network.

条款19、根据条款18所述的数据处理装置，其中所述至少一个维度选自输入通道维度和输出通道维度。Clause 19. The data processing apparatus of clause 18, wherein the at least one dimension is selected from an input channel dimension and an output channel dimension.

条款20、根据条款1-19任一所述的数据处理装置，其中Clause 20. A data processing apparatus according to any of clauses 1-19, wherein

所述稀疏指令中包括操作模式位来指示所述稀疏指令的操作模式，或者The sparse instruction includes an operation mode bit to indicate the operation mode of the sparse instruction, or

所述稀疏指令包括多条指令，每条指令对应一种或多种不同的操作模式。The sparse instruction includes a plurality of instructions, and each instruction corresponds to one or more different operation modes.

条款21、一种芯片，包括根据条款1-20任一所述的数据处理装置。Clause 21. A chip comprising a data processing device according to any of clauses 1-20.

条款22、一种板卡，包括根据条款21所述的芯片。Clause 22. A board comprising the chip of clause 21.

条款23、一种数据处理方法，包括：Article 23. A data processing method comprising:

解析稀疏指令，所述稀疏指令指示与结构化稀疏相关的操作，并且所述稀疏指令的至少一个操作数包括至少一个描述符，所述描述符指示以下至少一项信息：张量数据的形状信息和张量数据的空间信息；Parsing a sparse instruction, the sparse instruction indicating an operation related to structured sparse, and at least one operand of the sparse instruction includes at least one descriptor indicating at least one of the following: shape information of the tensor data and spatial information of tensor data;

对所述描述符进行解析；parsing the descriptor;

至少部分基于解析的描述符，读取相应的操作数；Based at least in part on the parsed descriptor, read the corresponding operand;

对所述操作数执行所述与结构化稀疏相关的操作；以及performing the structured sparsity-related operation on the operand; and

输出操作结果。Output the result of the operation.

条款24、根据条款23所述的数据处理方法，其中，对所述描述符进行解析包括：Clause 24. The data processing method of clause 23, wherein parsing the descriptor comprises:

根据所述形状信息，确定与所述操作数对应的张量数据在数据存储空间中的数据地址；和/或According to the shape information, determine the data address in the data storage space of the tensor data corresponding to the operand; and/or

根据所述空间信息，确定指令之间的依赖关系。According to the spatial information, the dependencies between the instructions are determined.

条款25、根据条款23-24任一所述的数据处理方法，其中所述张量数据的形状信息包括表示N维张量数据的形状的至少一个形状参数，N为正整数，所述张量数据的形状参数包括以下至少一种：Item 25. The data processing method according to any one of Items 23-24, wherein the shape information of the tensor data includes at least one shape parameter representing the shape of N-dimensional tensor data, N is a positive integer, and the tensor The shape parameters of the data include at least one of the following:

所述张量数据所在的数据存储空间在N个维度方向的至少一个方向上的尺寸、所述张量数据的存储区域在N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的偏移量、处于N个维度方向的对角位置的至少两个顶点相对于数据基准点的位置、所述张量数据的数据描述位置与数据地址之间的映射关系。The size of the data storage space in which the tensor data is located in at least one direction of N dimensions, the size of the storage area of the tensor data in at least one direction of N dimensions, the size of the storage area in at least one direction of N dimensions The offset in at least one direction of the N dimension directions, the position of at least two vertices at the diagonal positions of the N dimension directions relative to the data reference point, the difference between the data description position and the data address of the tensor data Mapping relations.

条款26、根据条款23-24任一所述的数据处理方法，其中所述张量数据的形状信息指示包括多个数据块的N维张量数据的形状的至少一个形状参数，N为正整数，所述形状参数包括以下至少一种：Clause 26. The data processing method according to any one of clauses 23-24, wherein the shape information of the tensor data indicates at least one shape parameter of the shape of the N-dimensional tensor data comprising a plurality of data blocks, and N is a positive integer , the shape parameters include at least one of the following:

条款27、根据条款23-26任一所述的数据处理方法，其中所述稀疏指令指示第一操作模式，并且所述稀疏指令的操作数包括待稀疏化数据，所述方法进一步包括：Clause 27. The data processing method of any of clauses 23-26, wherein the sparse instruction indicates a first mode of operation, and an operand of the sparse instruction includes data to be sparsed, the method further comprising:

根据所述稀疏指令，对所述待稀疏化数据执行结构化稀疏处理；以及performing structured sparse processing on the data to be sparsed according to the sparse instruction; and

输出稀疏化处理后的结构体，所述结构体包括相互绑定的数据部分和索引部分，所述数据部分包括所述待稀疏化数据经稀疏化处理后的数据，所述索引部分用于指示稀疏化后的数据在所述待稀疏化数据中的位置。Outputting a structure after sparse processing, the structure includes a data part and an index part bound to each other, the data part includes the data after the sparse processing of the data to be sparse, and the index part is used to indicate The position of the thinned data in the data to be thinned.

条款28、根据条款23-26任一所述的数据处理方法，其中所述稀疏指令指示第二操作模式，并且所述稀疏指令的操作数包括待稀疏化数据，所述方法进一步包括：Clause 28. The data processing method of any of clauses 23-26, wherein the sparse instruction indicates a second mode of operation, and an operand of the sparse instruction includes data to be sparsed, the method further comprising:

输出稀疏化处理后的数据部分，所述数据部分包括所述待稀疏化数据经稀疏化处理后的数据。A thinned-out data part is output, where the data part includes the thinned-out data of the data to be thinned out.

条款29、根据条款23-26任一所述的数据处理方法，其中所述稀疏指令指示第三操作模式，并且所述稀疏指令的操作数包括待稀疏化数据，所述方法进一步包括：Clause 29. The data processing method of any of clauses 23-26, wherein the sparse instruction indicates a third mode of operation, and an operand of the sparse instruction includes data to be sparsed, the method further comprising:

输出稀疏化处理后的索引部分，所述索引部分指示稀疏化后的数据在所述待稀疏化数据中的位置。A thinned-out index part is output, where the index part indicates the position of the thinned-out data in the data to be thinned out.

条款30、根据条款23-26任一所述的数据处理方法，其中所述稀疏指令指示第四操作模式，并且所述稀疏指令的操作数包括待稀疏化数据和稀疏索引，所述稀疏索引指示将要执行的结构化稀疏中有效数据元素的位置，所述方法进一步包括：Clause 30. The data processing method of any of clauses 23-26, wherein the sparse instruction indicates a fourth mode of operation, and operands of the sparse instruction include data to be thinned and a sparse index, the sparse index indicating the positions of valid data elements in the structured sparse to be performed, the method further comprising:

根据所述稀疏指令，按照所述稀疏索引指示的位置，对所述待稀疏化数据执行结构化稀疏处理；以及According to the sparse instruction, according to the position indicated by the sparse index, perform structured sparse processing on the data to be sparse; and

输出稀疏化处理后的结构体或稀疏化处理后的数据部分，所述结构体包括相互绑定的数据部分和索引部分，所述数据部分包括所述待稀疏化数据经稀疏化处理后的数据，所述索引部分用于指示稀疏化后的数据在所述待稀疏化数据中的位置。outputting a sparse-processed structure or a sparse-processed data part, where the structure includes a data part and an index part bound to each other, and the data part includes the sparse-processed data of the data to be sparsed , the index part is used to indicate the position of the thinned data in the data to be thinned.

条款31、根据条款23-26所述的数据处理方法，其中所述稀疏指令指示第五操作模式，并且所述稀疏指令的操作数包括经稀疏化处理后的数据部分和对应的索引部分，所述索引部分指示所述数据部分在稀疏化处理前的数据中的位置，所述方法进一步包括：Clause 31. The data processing method of clauses 23-26, wherein the sparse instruction indicates a fifth mode of operation, and an operand of the sparse instruction includes a thinned-out data portion and a corresponding index portion, wherein The index portion indicates the position of the data portion in the data before the thinning process, and the method further includes:

根据所述稀疏指令，将所述数据部分与所述索引部分绑定成结构体；以及binding the data portion and the index portion into a structure according to the sparse instruction; and

输出所述结构体。Output the structure.

条款32、根据条款23-26所述的数据处理方法，其中所述稀疏指令指示第六操作模式，并且所述稀疏指令的操作数包括经稀疏化处理后的数据部分和对应的索引部分，所述索引部分指示所述数据部分在稀疏化处理前的数据中的位置，所述方法进一步包括：Clause 32. The data processing method of clauses 23-26, wherein the sparse instruction indicates a sixth mode of operation, and an operand of the sparse instruction includes a thinned-out data portion and a corresponding index portion, wherein The index portion indicates the position of the data portion in the data before the thinning process, and the method further includes:

根据所述稀疏指令，按照所述索引部分指示的位置，对所述数据部分执行反稀疏化处理，以生成具有稀疏化处理前的数据格式的恢复数据；以及According to the thinning instruction, performing de-sparse processing on the data portion according to the position indicated by the index portion to generate restored data having the data format before the thinning processing; and

输出所述恢复数据。The restored data is output.

条款33、根据条款27-30任一所述的数据处理方法，其中所述结构化稀疏处理包括从每m个数据元素中选择n个数据元素作为有效数据元素，其中m>n。Clause 33. The data processing method of any of clauses 27-30, wherein the structured sparse processing comprises selecting n data elements from every m data elements as valid data elements, where m>n.

条款34、根据条款33所述的数据处理方法，其中所述结构化稀疏处理使用运算电路来实施，所述运算电路包括：至少一个多级流水运算电路，其包括逐级布置的多个运算器并且配置成根据所述稀疏指令来执行从m个数据元素中选择n个绝对值较大的数据元素作为有效数据元素的结构化稀疏处理。Clause 34. The data processing method of clause 33, wherein the structured sparse processing is implemented using an arithmetic circuit comprising: at least one multi-stage pipelined arithmetic circuit comprising a plurality of operators arranged in stages And it is configured to perform a structured sparse process of selecting n data elements with larger absolute values from m data elements as valid data elements according to the sparse instruction.

条款35、根据条款34所述的数据处理方法，其中所述多级流水运算电路包括四个流水级，其中：Clause 35. The data processing method of clause 34, wherein the multi-stage pipelined circuit includes four pipeline stages, wherein:

条款36、根据条款35所述的数据处理方法，其中所述第三流水级中每路比较电路包括m-1个比较器，第i路比较电路中的m-1个比较器用于将第i组数据中的一个绝对值与其他三个绝对值依次比较并生成比较结果，1≤i≤m。Item 36. The data processing method according to Item 35, wherein each comparison circuit in the third pipeline stage includes m-1 comparators, and the m-1 comparators in the i-th comparison circuit are used to compare the i-th comparison circuit. One absolute value in the group data is compared with the other three absolute values in turn and a comparison result is generated, 1≤i≤m.

条款37、根据条款35-36任一所述的数据处理方法，其中所述筛选电路进一步配置用于，当存在绝对值相同的数据元素时，按照指定的优先级顺序进行选择。Clause 37. The data processing method of any one of clauses 35-36, wherein the screening circuit is further configured to, when there are data elements with the same absolute value, select in accordance with a specified priority order.

条款38、根据条款32所述的数据处理方法，其中所述反稀疏化处理包括：Clause 38. The data processing method of Clause 32, wherein the anti-thinning process comprises:

条款39、根据条款27、30或31所述的数据处理方法，其中：Clause 39. A data processing method according to Clause 27, 30 or 31, wherein:

条款40、根据条款23-39任一所述的数据处理方法，其中所述稀疏指令用于神经网络中多维数据的至少一个维度的结构化稀疏处理。Clause 40. The data processing method of any of clauses 23-39, wherein the sparse instruction is used for structured sparse processing of at least one dimension of multidimensional data in a neural network.

条款41、根据条款41所述的数据处理方法，其中所述至少一个维度选自输入通道维度和输出通道维度。Clause 41. The data processing method of Clause 41, wherein the at least one dimension is selected from an input channel dimension and an output channel dimension.

条款42、根据条款23-41任一所述的数据处理方法，其中Clause 42. A data processing method according to any of clauses 23-41, wherein

所述稀疏指令中包括操作模式位来指示所述稀疏指令的操作模式，或者所述稀疏指令包括多条指令，每条指令对应一种或多种不同的操作模式。The sparse instruction includes an operation mode bit to indicate the operation mode of the sparse instruction, or the sparse instruction includes multiple instructions, and each instruction corresponds to one or more different operation modes.

以上对本披露实施例进行了详细介绍，本文中应用了具体个例对本披露的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本披露的方法及其核心思想；同时，对于本领域的一般技术人员，依据本披露的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本披露的限制。The embodiments of the present disclosure have been introduced in detail above, and the principles and implementations of the present disclosure are described in this paper by using specific examples. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure; at the same time, for Persons of ordinary skill in the art, according to the idea of the present disclosure, will have changes in the specific implementation manner and application scope. In conclusion, the contents of this description should not be construed as a limitation of the present disclosure.

Claims

1. A data processing device, comprising:

a control circuit configured to parse a sparse instruction, the sparse instruction indicating an operation related to structured sparse, and at least one operand of the sparse instruction includes at least one descriptor indicating at least one of the following information : shape information of tensor data and spatial information of tensor data;

Tensor interface circuitry configured to parse the descriptor;

a storage circuit configured to store information before and/or after thinning; and

An arithmetic circuit configured to perform a corresponding operation according to the sparse instruction based on the parsed descriptor.

2. The data processing apparatus according to claim 1, wherein,

The tensor interface circuit is configured to determine, according to the shape information, a data address in the data storage space of the tensor data corresponding to the operand; and/or

The tensor interface circuit is configured to determine dependencies between instructions according to the spatial information.

3. The data processing apparatus according to any one of claims 1-2, wherein the shape information of the tensor data comprises at least one shape parameter representing the shape of N-dimensional tensor data, N is a positive integer, and the tensor The shape parameters of the data include at least one of the following:

The size of the data storage space in which the tensor data is located in at least one of the N dimensions, the size of the storage area of the tensor data in at least one of the N dimensions, the size of the storage area in N The offset in at least one direction of the N dimension directions, the position of at least two vertices at the diagonal positions of the N dimension directions relative to the data reference point, the difference between the data description position and the data address of the tensor data Mapping relations.

4. The data processing apparatus according to any one of claims 1-2, wherein the shape information of the tensor data indicates at least one shape parameter of the shape of N-dimensional tensor data including a plurality of data blocks, and N is a positive integer , the shape parameters include at least one of the following:

The size of the data storage space where the tensor data is located in at least one of the N dimensions, the size of the storage area of a single data block in at least one of the N dimensions, the size of the data block in at least one of the N dimensions The block step size in at least one of the directions, the number of data blocks in at least one of the N-dimensional directions, and the overall step size of the data blocks in at least one of the N-dimensional directions.

5. The data processing apparatus according to any one of claims 1-4, wherein the sparse instruction indicates a first mode of operation, and an operand of the sparse instruction includes data to be sparsed,

The arithmetic circuit is configured to perform structured sparse processing on the data to be sparsed according to the sparse instruction, and output a sparse-processed structure to the storage circuit, where the structure includes mutually bound structures. A data part and an index part, where the data part includes the thinned data of the data to be thinned, and the index part is used to indicate the position of the thinned data in the data to be thinned.

6. The data processing apparatus according to any one of claims 1-4, wherein the sparse instruction indicates a second mode of operation, and an operand of the sparse instruction includes data to be sparsed,

The arithmetic circuit is configured to perform structured thinning processing on the data to be thinned according to the thinning instruction, and output a thinned data portion to the storage circuit, the data portion including the data to be thinned Data after sparse processing.

7. The data processing apparatus according to any one of claims 1-4, wherein the sparse instruction indicates a third mode of operation, and an operand of the sparse instruction includes data to be sparsed,

The arithmetic circuit is configured to perform structured thinning processing on the data to be thinned according to the thinning instruction, and output a thinned-out index portion to the storage circuit, where the index portion indicates the thinned-out data. The position of the data in the data to be thinned.

8. The data processing apparatus according to any one of claims 1-4, wherein the sparse instruction indicates a fourth operation mode, and operands of the sparse instruction include data to be sparsed and a sparse index, the sparse index indicating the positions of valid data elements in the structured sparse that will be performed,

The operation circuit is configured to perform structured sparse processing on the data to be sparsed according to the sparse instruction and according to the position indicated by the sparse index, and output the sparse-processed structure or the data to the storage circuit. The data part after sparse processing, the structure includes a data part and an index part bound to each other, the data part includes the data after the sparse processing of the data to be sparse, and the index part is used to indicate the sparseness The position of the thinned data in the data to be thinned.

9. The data processing apparatus according to any one of claims 1-4, wherein the sparse instruction indicates a fifth mode of operation, and an operand of the sparse instruction includes a sparse-processed data portion and a corresponding index portion , the index part indicates the position of the data part in the data before the thinning process,

The arithmetic circuit is configured to bind the data portion and the index portion into a structure according to the sparse instruction, and output the structure to the storage circuit.

10. The data processing apparatus according to any one of claims 1-4, wherein the sparse instruction indicates a sixth mode of operation, and an operand of the sparse instruction includes a sparse-processed data portion and a corresponding index portion , the index part indicates the position of the data part in the data before the thinning process,

The arithmetic circuit is configured to, according to the thinning instruction, perform de-sparse processing on the data portion according to the position indicated by the index portion, so as to generate restored data having the data format before the thinning processing, and send it to the data portion. The storage circuit outputs the restored data.

11. The data processing apparatus according to any one of claims 5-8, wherein the structured sparse processing comprises selecting n data elements from every m data elements as valid data elements, where m>n.

12. The data processing apparatus according to claim 11, wherein the arithmetic circuit further comprises: at least one multi-stage pipeline arithmetic circuit comprising a plurality of arithmetic units arranged in stages and configured to execute the slave sparse instruction according to the sparse instruction. The structured sparse processing of selecting n data elements with larger absolute values from m data elements as valid data elements.

13. The data processing apparatus of claim 12, wherein the multi-stage pipeline operation circuit comprises four pipeline stages, wherein:

The first pipeline stage includes m absolute value operators for respectively taking absolute values of m data elements to be sparsed to generate m absolute values;

The second pipeline stage includes a permutation and combination circuit for permuting and combining the m absolute values to generate m sets of data, wherein each set of data includes the m absolute values and the m absolute values are in each set The locations in the data are different from each other;

The third pipeline stage includes m comparison circuits for comparing absolute values in the m sets of data and generating a comparison result; and

The fourth pipeline stage includes a screening circuit for selecting n data elements with larger absolute values as valid data elements according to the comparison result, and outputting the valid data elements and a corresponding index, the index indicating the valid data the position of the element within the m data elements.

14. The data processing apparatus according to claim 13, wherein each comparison circuit in the third pipeline stage comprises m-1 comparators, and the m-1 comparators in the i-th comparison circuit are used to compare the i-th comparison circuit. One absolute value in the group data is compared with the other three absolute values in turn and a comparison result is generated, 1≤i≤m.

15. The data processing apparatus according to any one of claims 13-14, wherein the screening circuit is further configured to select according to a specified priority order when there are data elements with the same absolute value.

16. The data processing apparatus of claim 10, wherein the anti-thinning process comprises:

According to the position indicated by the index part, according to the data format before the thinning process, each data element in the data part is respectively placed in the corresponding position of the data format before the thinning process, and in the rest of the data format. The location is populated with predetermined information to generate the recovery data.

17. A data processing apparatus according to claim 5, 8 or 9, wherein

Each 1 bit in the index portion of the structure corresponds to the position of N bits of data, where N is determined at least in part based on hardware configuration; and/or

The data part in the structure is aligned according to the first alignment requirement, and the index part in the structure is aligned according to the second alignment requirement.

18. The data processing apparatus according to any one of claims 1-17, wherein the sparse instruction is used for structured sparse processing of at least one dimension of multidimensional data in a neural network.

19. The data processing apparatus of claim 18, wherein the at least one dimension is selected from an input channel dimension and an output channel dimension.

20. The data processing apparatus according to any one of claims 1-19, wherein

The sparse instruction includes an operation mode bit to indicate the operation mode of the sparse instruction, or

The sparse instruction includes a plurality of instructions, and each instruction corresponds to one or more different operation modes.

21. A chip comprising the data processing device according to any one of claims 1-20.

22. A board comprising the chip according to claim 21.

23. A data processing method, comprising:

Parsing a sparse instruction, the sparse instruction indicating an operation related to structured sparse, and at least one operand of the sparse instruction includes at least one descriptor indicating at least one of the following: shape information of the tensor data and spatial information of tensor data;

parsing the descriptor;

Based at least in part on the parsed descriptor, read the corresponding operand;

performing the structured sparsity-related operation on the operand; and

Output the result of the operation.

24. The data processing method of claim 23, wherein parsing the descriptor comprises:

According to the shape information, determine the data address in the data storage space of the tensor data corresponding to the operand; and/or

According to the spatial information, the dependencies between the instructions are determined.

25. The data processing method according to any one of claims 23-24, wherein the shape information of the tensor data includes at least one shape parameter representing the shape of N-dimensional tensor data, N is a positive integer, and the tensor The shape parameters of the data include at least one of the following:

26. The data processing method according to any one of claims 23-24, wherein the shape information of the tensor data indicates at least one shape parameter of the shape of the N-dimensional tensor data including a plurality of data blocks, and N is a positive integer , the shape parameters include at least one of the following:

27. The data processing method according to any one of claims 23-26, wherein the sparse instruction indicates a first mode of operation, and an operand of the sparse instruction includes data to be sparsed, the method further comprising:

performing structured sparse processing on the data to be sparsed according to the sparse instruction; and

Outputting a structure after sparse processing, the structure includes a data part and an index part bound to each other, the data part includes the data after the sparse processing of the data to be sparse, and the index part is used to indicate The position of the thinned data in the data to be thinned.

28. The data processing method according to any one of claims 23-26, wherein the sparse instruction indicates a second mode of operation, and an operand of the sparse instruction includes data to be sparsed, the method further comprising:

A thinned-out data part is output, where the data part includes the thinned-out data of the data to be thinned out.

29. The data processing method according to any one of claims 23-26, wherein the sparse instruction indicates a third mode of operation, and an operand of the sparse instruction includes data to be sparsed, the method further comprising:

A thinned-out index part is output, where the index part indicates the position of the thinned-out data in the data to be thinned out.

30. The data processing method according to any one of claims 23-26, wherein the sparse instruction indicates a fourth operation mode, and operands of the sparse instruction include data to be sparsed and a sparse index, the sparse index indicating the positions of valid data elements in the structured sparse to be performed, the method further comprising:

According to the sparse instruction, according to the position indicated by the sparse index, perform structured sparse processing on the data to be sparse; and

outputting a sparse-processed structure or a sparse-processed data part, where the structure includes a data part and an index part bound to each other, and the data part includes the sparse-processed data of the data to be sparsed , the index part is used to indicate the position of the thinned data in the data to be thinned.

31. The data processing method according to claims 23-26, wherein the sparse instruction indicates a fifth operation mode, and an operand of the sparse instruction includes a sparse-processed data portion and a corresponding index portion, so The index portion indicates the position of the data portion in the data before the thinning process, and the method further includes:

binding the data portion and the index portion into a structure according to the sparse instruction; and

Output the structure.

32. The data processing method according to claims 23-26, wherein the sparse instruction indicates a sixth mode of operation, and an operand of the sparse instruction includes a sparse-processed data portion and a corresponding index portion, so The index portion indicates the position of the data portion in the data before the thinning process, and the method further includes:

According to the thinning instruction, performing de-sparse processing on the data portion according to the position indicated by the index portion to generate restored data having the data format before the thinning processing; and

The restored data is output.

33. The data processing method according to any one of claims 27-30, wherein the structured sparse processing comprises selecting n data elements from every m data elements as valid data elements, where m>n.

34. The data processing method of claim 33, wherein the structured sparse processing is implemented using an arithmetic circuit comprising: at least one multi-stage pipeline arithmetic circuit comprising a plurality of operators arranged in stages And it is configured to perform a structured sparse process of selecting n data elements with larger absolute values as valid data elements from the m data elements according to the sparse instruction.

35. The data processing method according to claim 34, wherein the multi-stage pipeline operation circuit comprises four pipeline stages, wherein:

36. The data processing method according to claim 35, wherein each comparison circuit in the third pipeline stage includes m-1 comparators, and the m-1 comparators in the i-th comparison circuit are used to compare the i-th comparison circuit. One absolute value in the group data is compared with the other three absolute values in turn and a comparison result is generated, 1≤i≤m.

37. The data processing method according to any one of claims 35-36, wherein the screening circuit is further configured to select according to a specified priority order when there are data elements with the same absolute value.

38. The data processing method of claim 32, wherein the anti-thinning process comprises:

According to the position indicated by the index part, and according to the data format before thinning processing, each data element in the data part is respectively placed in the corresponding position of the data format before thinning processing, and in the rest of the data format. The location is populated with predetermined information to generate the recovery data.

39. The data processing method of claim 27, 30 or 31, wherein:

40. The data processing method according to any one of claims 23-39, wherein the sparse instruction is used for structured sparse processing of at least one dimension of multidimensional data in a neural network.

41. The data processing method of claim 40, wherein the at least one dimension is selected from an input channel dimension and an output channel dimension.

42. The data processing method according to any one of claims 23-41, wherein

The sparse instructions include a plurality of instructions, and each instruction corresponds to one or more different operation modes.